Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic Smyth Department of Information and Computer Science University of California, Irvine
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Text Mining Information Retrieval Text Classification Text Clustering Information Extraction
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Document Clustering Set of documents D in term-vector form –no class labels this time –want to group the documents into K groups or into a taxonomy –Each cluster hypothetically corresponds to a “topic” Methods: –Any of the well-known clustering methods –K-means E.g., “spherical k-means”, normalize document distances –Hierarchical clustering –Probabilistic model-based clustering methods e.g., mixtures of multinomials Single-topic versus multiple-topic models –Extensions to author-topic models
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering Conditional Independence model for each component (often quite useful to first-order)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixtures of Documents Terms Documents Component 1 Component 2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing C2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A Document Cluster Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write drive problem mail articl hard work system good time Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi drive hard card format softwar memori install disk engin
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Another Document Cluster Most Likely Terms in Component 1 weight = 0.11 : TERM p(t|k) articl good dai fact god claim apr fbi christian group Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi jesu fire christian evid god gun faith kill bibl
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A topic is represented as a (multinomial) distribution over words SPEECH.0691 WORDS.0671 RECOGNITION.0412 WORD.0557 SPEAKER.0288 USER.0230 PHONEME.0224 DOCUMENTS.0205 CLASSIFICATION.0154 TEXT.0195 SPEAKERS.0140 RETRIEVAL.0152 FRAME.0135 INFORMATION.0144 PHONETIC.0119 DOCUMENT.0144 PERFORMANCE.0111 LARGE.0102 ACOUSTIC.0099 COLLECTION.0098 BASED.0098 KNOWLEDGE.0087 PHONEMES.0091 MACHINE.0080 UTTERANCES.0091 RELEVANT.0077 SET.0089 SEMANTIC.0076 LETTER.0088 SIMILARITY.0071 … … Example topic #1 Example topic #2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The basic model…. C X1X1 X2X2 XdXd
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C History: - latent class models in statistics - Hofmann applied to documents (SIGIR ’99) - recent extensions, e.g., Blei, Jordan, Ng (JMLR, 2003) - variously known as factor/aspect/latent class models
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C Inference can be intractable due to undirected loops!
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model for documents…. Multi-topic model –A document is generated from multiple components –Multiple components can be active at once –Each component = multinomial distribution –Parameter estimation is tricky –Very useful: “parses” into high-level semantic components
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) = (1) w P(w|z = 2) = (2)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Choose mixture weights for each document, generate “bag of words” = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine words documents U D V words dims vectors documents SVD words documents words topics documents LDA P(w|z)P(w|z) P(z)P(z) P(w)P(w) (Dumais, Landauer)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine History of multi-topic models Latent class models in statistics Hoffman 1999 –Original application to documents Blei, Ng, and Jordan (2001, 2003) –Variational methods Griffiths and Steyvers (2003) –Gibbs sampling approach (very efficient)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine GROUP DYNAMIC DISTRIBUTED RESEARCH MULTICAST STRUCTURE COMPUTING SUPPORTED INTERNET STRUCTURES SYSTEMS PART PROTOCOL STATIC SYSTEM GRANT RELIABLE PAPER HETEROGENEOUS SCIENCE GROUPS DYNAMICALLY ENVIRONMENT FOUNDATION PROTOCOLS PRESENT PAPER FL IP META SUPPORT WORK TRANSPORT CALLED ARCHITECTURE NATIONAL DRAFT RECURSIVE ENVIRONMENTS NSF “Content” components “Boilerplate” components
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine DIMENSIONAL RULES ORDER GRAPH POINTS CLASSIFICATION TERMS PATH SURFACE RULE PARTIAL GRAPHS GEOMETRIC ACCURACY HIGHER PATHS SURFACES ATTRIBUTES REDUCTION EDGE MESH INDUCTION PAPER NUMBER PLANE CLASSIFIER TERM CONNECTED POINT SET ORDERING DIRECTED GEOMETRY ATTRIBUTE SHOW NODES PLANAR CLASSIFIERS MAGNITUDE VERTICES INFORMATION SYSTEM PAPER LANGUAGE TEXT FILE CONDITIONS PROGRAMMING RETRIEVAL OPERATING CONCEPT LANGUAGES SOURCES STORAGE CONCEPTS FUNCTIONAL DOCUMENT DISK DISCUSSED SEMANTICS DOCUMENTS SYSTEMS DEFINITION SEMANTIC RELEVANT KERNEL ISSUES NATURAL CONTENT ACCESS PROPERTIES CONSTRUCTS AUTOMATICALLY MANAGEMENT IMPORTANT GRAMMAR DIGITAL UNIX EXAMPLES LISP
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine MODEL PAPER TYPE KNOWLEDGE MODELS APPROACHES SPECIFICATION SYSTEM MODELING PROPOSED TYPES SYSTEMS QUALITATIVE CHANGE FORMAL BASE COMPLEX BELIEF VERIFICATION EXPERT QUANTITATIVE ALTERNATIVE SPECIFICATIONS ACQUISITION CAPTURE APPROACH CHECKING DOMAIN MODELED ORIGINAL SYSTEM INTELLIGENT ACCURATELY SHOW PROPERTIES BASES REALISTIC PROPOSE ABSTRACT BASED “Style” components
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Recent Results on Author-Topic Models
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003) Words Hidden Topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The Author-Topic Model: Assumptions of Generative Model Each author is associated with a topics mixture Each document is a mixture of topics With multiple authors, the document will be a mixture of the topics mixtures of the coauthors Each word in a text is generated from one topic and one author (potentially different for each word)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper –A 1 has multinomial topic distribution –A 2 has multinomial topic distribution For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from a X 3.Sample a word w from a multinomial topic distribution z
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Graphical Model 1. Choose an author 2. Choose a topic 3. Choose a word From the set of co-authors …
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Model Estimation Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Estimation is efficient: linear in data size Infer: –Author-Topic distributions ( –Topic-Word distributions
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Data 1700 proceedings papers from NIPS (2000+ authors) (NIPS = Neural Information Processing Systems) 160,000 CiteSeer abstracts (85,000+ authors) Removed stop words Word order is irrelevant, just use word counts Processing time: Nips: 2000 Gibbs iterations 12 hours on PC workstation CiteSeer: 700 Gibbs iterations 111 hours
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Author Modeling Data Sets SourceDocumentsUnique Authors Unique WordsTotal Word Count CiteSeer163,38985,46530, million CORA13,64311,42711, million NIPS1,7402,03713, million
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four example topics from CiteSeer (T=300)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four more topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: –Topic 1: classification, training, generalization, decision, data,… –Topic 2: learning, machine, examples, reinforcement, inductive,….. –Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four example topics from NIPS (T=100)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four more topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, –Majority of topics are stable over processing time –Majority of topics can be aligned across runs Topics appear to represent genuine structure in data
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing NIPS topics from the same chain (t 1 =1000, and t 2 =2000) KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing NIPS topics and CiteSeer topics KL distance NIPS topics Re-ordered CiteSeer topics KL = 2.88 KL = 4.48 KL = 4.92 KL = 5.0
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Detecting Unusual Papers by Authors For any paper by an author, we can calculate how surprising words in a document are: some papers are on unusual topics by author Papers ranked by unusualness (perplexity) for C. Faloutsos Papers ranked by unusualness (perplexity) for M. Jordan
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Author Separation A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Can model attribute words to authors correctly within a document? Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Applications of Author-Topic Models “Expert Finder” –“Find researchers who are knowledgeable in cryptography and machine learning within 100 miles of Washington DC” –“Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Prediction –Given a document and some subset of known authors for the paper (k=0,1,2…), predict the other authors –Predict how many papers in different topics will appear next year Change Detection/Monitoring –Which authors are on the leading edge of new topics? –Characterize the “topic trajectory” of this author over time
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Rise in Web, Mobile, JAVA Web
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Rise of Machine Learning
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Bayes lives on….
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in Languages, OS, …
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in CS Theory, …
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trends in Database Research
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trends in NLP and IR IR NLP
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Security Research Reborn…
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine (Not so) Hot Topics Neural Networks GAs Wavelets
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in use of Greek Letters
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Future Work Theory development –Incorporate citation information, collaboration networks –Other document types, e.g., handling subject lines, threads, and “to” and “cc” fields New datasets: –Enron corpus –Web pages –PubMed abstracts (possibly)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine New applications of author-topic models Black box for text document collection summarization –Automatically extract a summary of relevant topics and author patterns for a large data set such as Enron “Expert Finder” –“Find researchers who are knowledgeable in cryptography and machine learning within 100 miles of Washington DC” –“Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Change Detection/Monitoring –Which authors are on the leading edge of new topics? –Characterize the “topic trajectory” of this author over time Prediction (work in progress) –Given a document and some subset of known authors for the paper (k=0,1,2…), predict the other authors –Predict how many papers in different topics will appear next year
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The Author- Topic Browser (b) (a) (c) Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Scientific syntax and semantics z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE ……………………………… z = z = x = 1 x = 3 x = 2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE…………………… z = z = x = 1 x = 3 x = 2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF……………… z = z = x = 1 x = 3 x = 2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF RESEARCH …… z = z = x = 1 x = 3 x = 2
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Semantic topics
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY