Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Information retrieval – LSI, pLSI and LDA

Support Vector Machines

Machine learning continued Image source:

Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Machine Learning Neural Networks

Generative Topic Models for Community Analysis

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Principal Component Analysis

CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

British Museum Library, London Picture Courtesy: flickr.

Presented by Zeehasham Rasheed

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Scalable Text Mining with Sparse Generative Models

Data Mining – Intro.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.

Introduction to Machine Learning Approach Lecture 5.

Chapter 5: Information Retrieval and Web Search

19 April, 2017 Knowledge and image processing algorithms for real-life applications. Dr. Maria Athelogou Principal Scientist & Scientific Liaison Manager.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Data Mining Techniques

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Lecture #13: Gibbs Sampling for LDA

Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.

Anomaly detection with Bayesian networks Website: John Sandiford.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.

Taxonomies and Laws Lecture 10. Taxonomies and Laws Taxonomies enumerate scientifically relevant classes and organize them into a hierarchical structure,

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Note:

Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Data Mining – Intro.

Online Multiscale Dynamic Topic Models

School of Computer Science & Engineering

Multimodal Learning with Deep Boltzmann Machines

Efficient Estimation of Word Representation in Vector Space

Michal Rosen-Zvi University of California, Irvine

Hierarchical Relational Models for Document Networks

A task of induction to find patterns

A task of induction to find patterns

Presentation transcript:

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Text Mining Information Retrieval Text Classification Text Clustering Information Extraction

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Document Clustering Set of documents D in term-vector form –no class labels this time –want to group the documents into K groups or into a taxonomy –Each cluster hypothetically corresponds to a “topic” Methods: –Any of the well-known clustering methods –K-means E.g., “spherical k-means”, normalize document distances –Hierarchical clustering –Probabilistic model-based clustering methods e.g., mixtures of multinomials Single-topic versus multiple-topic models –Extensions to author-topic models

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering Conditional Independence model for each component (often quite useful to first-order)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixtures of Documents Terms Documents Component 1 Component 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing C2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Terms Documents C1 C Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A Document Cluster Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write drive problem mail articl hard work system good time Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi drive hard card format softwar memori install disk engin

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Another Document Cluster Most Likely Terms in Component 1 weight = 0.11 : TERM p(t|k) articl good dai fact god claim apr fbi christian group Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi jesu fire christian evid god gun faith kill bibl

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A topic is represented as a (multinomial) distribution over words SPEECH.0691 WORDS.0671 RECOGNITION.0412 WORD.0557 SPEAKER.0288 USER.0230 PHONEME.0224 DOCUMENTS.0205 CLASSIFICATION.0154 TEXT.0195 SPEAKERS.0140 RETRIEVAL.0152 FRAME.0135 INFORMATION.0144 PHONETIC.0119 DOCUMENT.0144 PERFORMANCE.0111 LARGE.0102 ACOUSTIC.0099 COLLECTION.0098 BASED.0098 KNOWLEDGE.0087 PHONEMES.0091 MACHINE.0080 UTTERANCES.0091 RELEVANT.0077 SET.0089 SEMANTIC.0076 LETTER.0088 SIMILARITY.0071 … … Example topic #1 Example topic #2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The basic model…. C X1X1 X2X2 XdXd

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C History: - latent class models in statistics - Hofmann applied to documents (SIGIR ’99) - recent extensions, e.g., Blei, Jordan, Ng (JMLR, 2003) - variously known as factor/aspect/latent class models

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C Inference can be intractable due to undirected loops!

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model for documents…. Multi-topic model –A document is generated from multiple components –Multiple components can be active at once –Each component = multinomial distribution –Parameter estimation is tricky –Very useful: “parses” into high-level semantic components

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) =  (1) w P(w|z = 2) =  (2)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Choose mixture weights for each document, generate “bag of words”  = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine words documents  U D V  words dims vectors documents SVD words  documents words topics documents LDA P(w|z)P(w|z) P(z)P(z) P(w)P(w) (Dumais, Landauer)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine History of multi-topic models Latent class models in statistics Hoffman 1999 –Original application to documents Blei, Ng, and Jordan (2001, 2003) –Variational methods Griffiths and Steyvers (2003) –Gibbs sampling approach (very efficient)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine GROUP DYNAMIC DISTRIBUTED RESEARCH MULTICAST STRUCTURE COMPUTING SUPPORTED INTERNET STRUCTURES SYSTEMS PART PROTOCOL STATIC SYSTEM GRANT RELIABLE PAPER HETEROGENEOUS SCIENCE GROUPS DYNAMICALLY ENVIRONMENT FOUNDATION PROTOCOLS PRESENT PAPER FL IP META SUPPORT WORK TRANSPORT CALLED ARCHITECTURE NATIONAL DRAFT RECURSIVE ENVIRONMENTS NSF “Content” components “Boilerplate” components

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine DIMENSIONAL RULES ORDER GRAPH POINTS CLASSIFICATION TERMS PATH SURFACE RULE PARTIAL GRAPHS GEOMETRIC ACCURACY HIGHER PATHS SURFACES ATTRIBUTES REDUCTION EDGE MESH INDUCTION PAPER NUMBER PLANE CLASSIFIER TERM CONNECTED POINT SET ORDERING DIRECTED GEOMETRY ATTRIBUTE SHOW NODES PLANAR CLASSIFIERS MAGNITUDE VERTICES INFORMATION SYSTEM PAPER LANGUAGE TEXT FILE CONDITIONS PROGRAMMING RETRIEVAL OPERATING CONCEPT LANGUAGES SOURCES STORAGE CONCEPTS FUNCTIONAL DOCUMENT DISK DISCUSSED SEMANTICS DOCUMENTS SYSTEMS DEFINITION SEMANTIC RELEVANT KERNEL ISSUES NATURAL CONTENT ACCESS PROPERTIES CONSTRUCTS AUTOMATICALLY MANAGEMENT IMPORTANT GRAMMAR DIGITAL UNIX EXAMPLES LISP

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine MODEL PAPER TYPE KNOWLEDGE MODELS APPROACHES SPECIFICATION SYSTEM MODELING PROPOSED TYPES SYSTEMS QUALITATIVE CHANGE FORMAL BASE COMPLEX BELIEF VERIFICATION EXPERT QUANTITATIVE ALTERNATIVE SPECIFICATIONS ACQUISITION CAPTURE APPROACH CHECKING DOMAIN MODELED ORIGINAL SYSTEM INTELLIGENT ACCURATELY SHOW PROPERTIES BASES REALISTIC PROPOSE ABSTRACT BASED “Style” components

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Recent Results on Author-Topic Models

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003) Words Hidden Topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The Author-Topic Model: Assumptions of Generative Model Each author is associated with a topics mixture Each document is a mixture of topics With multiple authors, the document will be a mixture of the topics mixtures of the coauthors Each word in a text is generated from one topic and one author (potentially different for each word)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper –A 1 has multinomial topic distribution   –A 2 has multinomial topic distribution   For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from a  X 3.Sample a word w from a multinomial topic distribution  z

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Graphical Model 1. Choose an author 2. Choose a topic 3. Choose a word From the set of co-authors …

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Model Estimation Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Estimation is efficient: linear in data size Infer: –Author-Topic distributions (  –Topic-Word distributions 

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Data 1700 proceedings papers from NIPS (2000+ authors) (NIPS = Neural Information Processing Systems) 160,000 CiteSeer abstracts (85,000+ authors) Removed stop words Word order is irrelevant, just use word counts Processing time: Nips: 2000 Gibbs iterations  12 hours on PC workstation CiteSeer: 700 Gibbs iterations  111 hours

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Author Modeling Data Sets SourceDocumentsUnique Authors Unique WordsTotal Word Count CiteSeer163,38985,46530, million CORA13,64311,42711, million NIPS1,7402,03713, million

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four example topics from CiteSeer (T=300)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four more topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: –Topic 1: classification, training, generalization, decision, data,… –Topic 2: learning, machine, examples, reinforcement, inductive,….. –Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four example topics from NIPS (T=100)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four more topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, –Majority of topics are stable over processing time –Majority of topics can be aligned across runs Topics appear to represent genuine structure in data

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing NIPS topics from the same chain (t 1 =1000, and t 2 =2000) KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing NIPS topics and CiteSeer topics KL distance NIPS topics Re-ordered CiteSeer topics KL = 2.88 KL = 4.48 KL = 4.92 KL = 5.0

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Detecting Unusual Papers by Authors For any paper by an author, we can calculate how surprising words in a document are: some papers are on unusual topics by author Papers ranked by unusualness (perplexity) for C. Faloutsos Papers ranked by unusualness (perplexity) for M. Jordan

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Author Separation A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Can model attribute words to authors correctly within a document? Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Applications of Author-Topic Models “Expert Finder” –“Find researchers who are knowledgeable in cryptography and machine learning within 100 miles of Washington DC” –“Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Prediction –Given a document and some subset of known authors for the paper (k=0,1,2…), predict the other authors –Predict how many papers in different topics will appear next year Change Detection/Monitoring –Which authors are on the leading edge of new topics? –Characterize the “topic trajectory” of this author over time

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

Rise in Web, Mobile, JAVA Web

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Rise of Machine Learning

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Bayes lives on….

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in Languages, OS, …

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in CS Theory, …

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trends in Database Research

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trends in NLP and IR IR NLP

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Security Research Reborn…

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine (Not so) Hot Topics Neural Networks GAs Wavelets

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in use of Greek Letters 

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Future Work Theory development –Incorporate citation information, collaboration networks –Other document types, e.g., handling subject lines, threads, and “to” and “cc” fields New datasets: –Enron corpus –Web pages –PubMed abstracts (possibly)

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine New applications of author-topic models Black box for text document collection summarization –Automatically extract a summary of relevant topics and author patterns for a large data set such as Enron “Expert Finder” –“Find researchers who are knowledgeable in cryptography and machine learning within 100 miles of Washington DC” –“Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Change Detection/Monitoring –Which authors are on the leading edge of new topics? –Characterize the “topic trajectory” of this author over time Prediction (work in progress) –Given a document and some subset of known authors for the paper (k=0,1,2…), predict the other authors –Predict how many papers in different topics will appear next year

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The Author- Topic Browser (b) (a) (c) Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Scientific syntax and semantics  z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x =

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE ……………………………… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE…………………… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF……………… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN THE LOVE OF RESEARCH …… z = z = x = 1 x = 3 x = 2

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Semantic topics

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY