Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic.

Similar presentations


Presentation on theme: "Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic."— Presentation transcript:

1 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic Smyth Department of Information and Computer Science University of California, Irvine

2 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Text Mining Information Retrieval Text Classification Text Clustering Information Extraction

3 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Document Clustering Set of documents D in term-vector form –no class labels this time –want to group the documents into K groups or into a taxonomy –Each cluster hypothetically corresponds to a “topic” Methods: –Any of the well-known clustering methods –K-means E.g., “spherical k-means”, normalize document distances –Hierarchical clustering –Probabilistic model-based clustering methods e.g., mixtures of multinomials Single-topic versus multiple-topic models –Extensions to author-topic models

4 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering

5 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering

6 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixture Model Clustering Conditional Independence model for each component (often quite useful to first-order)

7 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Mixtures of Documents 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents 1 1 1 1 Component 1 Component 2

8 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents 1 1 1 1

9 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents C1 C2 1 1 1 1 Treat as Missing C2

10 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents C1 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) E-Step: estimate component membership probabilities given current parameter estimates

11 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 1111 11111 1 111 1 11 111 1 1 1 11 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 Terms Documents C1 C2 1 1 1 1 Treat as Missing P(C1|x1) P(C1|..) P(C2|x1) P(C2|..) M-Step: use “fractional” weighted data to get new estimates of the parameters

12 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A Document Cluster Most Likely Terms in Component 5: weight = 0.08 TERM p(t|k) write 0.571 drive 0.465 problem 0.369 mail 0.364 articl 0.332 hard 0.323 work 0.319 system 0.303 good 0.296 time 0.273 Highest Lift Terms in Component 5 weight = 0.08 TERM LIFT p(t|k) p(t) scsi 7.7 0.13 0.02 drive 5.7 0.47 0.08 hard 4.9 0.32 0.07 card 4.2 0.23 0.06 format 4.0 0.12 0.03 softwar 3.8 0.21 0.05 memori 3.6 0.14 0.04 install 3.6 0.14 0.04 disk 3.5 0.12 0.03 engin 3.3 0.21 0.06

13 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Another Document Cluster Most Likely Terms in Component 1 weight = 0.11 : TERM p(t|k) articl 0.684 good 0.368 dai 0.363 fact 0.322 god 0.320 claim 0.294 apr 0.279 fbi 0.256 christian 0.256 group 0.239 Highest Lift Terms in Component 1: weight = 0.11 : TERM LIFT p(t|k) p(t) fbi 8.3 0.26 0.03 jesu 5.5 0.16 0.03 fire 5.2 0.20 0.04 christian 4.9 0.26 0.05 evid 4.8 0.24 0.05 god 4.6 0.32 0.07 gun 4.2 0.17 0.04 faith 4.2 0.12 0.03 kill 3.8 0.22 0.06 bibl 3.7 0.11 0.03

14 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A topic is represented as a (multinomial) distribution over words SPEECH.0691 WORDS.0671 RECOGNITION.0412 WORD.0557 SPEAKER.0288 USER.0230 PHONEME.0224 DOCUMENTS.0205 CLASSIFICATION.0154 TEXT.0195 SPEAKERS.0140 RETRIEVAL.0152 FRAME.0135 INFORMATION.0144 PHONETIC.0119 DOCUMENT.0144 PERFORMANCE.0111 LARGE.0102 ACOUSTIC.0099 COLLECTION.0098 BASED.0098 KNOWLEDGE.0087 PHONEMES.0091 MACHINE.0080 UTTERANCES.0091 RELEVANT.0077 SET.0089 SEMANTIC.0076 LETTER.0088 SIMILARITY.0071 … … Example topic #1 Example topic #2

15 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The basic model…. C X1X1 X2X2 XdXd

16 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C

17 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C History: - latent class models in statistics - Hofmann applied to documents (SIGIR ’99) - recent extensions, e.g., Blei, Jordan, Ng (JMLR, 2003) - variously known as factor/aspect/latent class models

18 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model…. A X1X1 X2X2 XdXd B C Inference can be intractable due to undirected loops!

19 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A better model for documents…. Multi-topic model –A document is generated from multiple components –Multiple components can be active at once –Each component = multinomial distribution –Parameter estimation is tricky –Very useful: “parses” into high-level semantic components

20 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine A generative model for documents HEART0.2 LOVE0.2 SOUL0.2 TEARS0.2 JOY0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK 0.0 RESEARCH0.0 MATHEMATICS0.0 HEART0.0 LOVE0.0 SOUL0.0 TEARS0.0 JOY0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH0.2 MATHEMATICS0.2 topic 1topic 2 w P(w|z = 1) =  (1) w P(w|z = 2) =  (2)

21 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Choose mixture weights for each document, generate “bag of words”  = {P(z = 1), P(z = 2)} {0, 1} {0.25, 0.75} {0.5, 0.5} {0.75, 0.25} {1, 0} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK TEARS SOUL KNOWLEDGE HEART WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY

22 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine pixel = word image = document sample each pixel from a mixture of topics A visual example: Bars

23 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

24

25 Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good

26 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine words documents  U D V  words dims vectors documents SVD words  documents words topics documents LDA P(w|z)P(w|z) P(z)P(z) P(w)P(w) (Dumais, Landauer)

27 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine History of multi-topic models Latent class models in statistics Hoffman 1999 –Original application to documents Blei, Ng, and Jordan (2001, 2003) –Variational methods Griffiths and Steyvers (2003) –Gibbs sampling approach (very efficient)

28 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE SURFACE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC A selection of topics TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN

29 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

30 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMODIUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFECTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA CRUZI BRUCEI HUMAN HOSTS ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATAL EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BIRTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROGENESIS ADULTS CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROMOSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MAPS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFSPRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCTION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUDY DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USING FINDINGS DEMONSTRATE REPORT INDICATED CONSISTENT REPORTS CONTRAST A selection of topics MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKNOWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RESPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING LARGELY KNOWN MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMPLE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PREDICTIONS CONSISTENT PARAMETERS

31 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 1 2 3 4 GROUP 0.057185 DYNAMIC 0.152141 DISTRIBUTED 0.192926 RESEARCH 0.066798 MULTICAST 0.051620 STRUCTURE 0.137964 COMPUTING 0.044376 SUPPORTED 0.043233 INTERNET 0.049499 STRUCTURES 0.088040 SYSTEMS 0.038601 PART 0.035590 PROTOCOL 0.041615 STATIC 0.043452 SYSTEM 0.031797 GRANT 0.034476 RELIABLE 0.020877 PAPER 0.032706 HETEROGENEOUS 0.030996 SCIENCE 0.023250 GROUPS 0.019552 DYNAMICALLY 0.023940 ENVIRONMENT 0.023163 FOUNDATION 0.022653 PROTOCOLS 0.019088 PRESENT 0.015328 PAPER 0.017960 FL 0.021220 IP 0.014980 META 0.015175 SUPPORT 0.016587 WORK 0.021061 TRANSPORT 0.012529 CALLED 0.011669 ARCHITECTURE 0.016416 NATIONAL 0.019947 DRAFT 0.009945 RECURSIVE 0.010145 ENVIRONMENTS 0.013271 NSF 0.018116 “Content” components “Boilerplate” components

32 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 5 6 7 8 DIMENSIONAL 0.038901 RULES 0.090569 ORDER 0.192759 GRAPH 0.095687 POINTS 0.037263 CLASSIFICATION 0.062699 TERMS 0.048688 PATH 0.061784 SURFACE 0.031438 RULE 0.062174 PARTIAL 0.044907 GRAPHS 0.061217 GEOMETRIC 0.025006 ACCURACY 0.028926 HIGHER 0.041284 PATHS 0.030151 SURFACES 0.020152 ATTRIBUTES 0.023090 REDUCTION 0.035061 EDGE 0.028590 MESH 0.016875 INDUCTION 0.021909 PAPER 0.028602 NUMBER 0.022775 PLANE 0.013902 CLASSIFIER 0.019418 TERM 0.018204 CONNECTED 0.016817 POINT 0.013780 SET 0.018303 ORDERING 0.017652 DIRECTED 0.014405 GEOMETRY 0.013780 ATTRIBUTE 0.016204 SHOW 0.017022 NODES 0.013625 PLANAR 0.012385 CLASSIFIERS 0.015417 MAGNITUDE 0.015526 VERTICES 0.013554 9 10 11 12 INFORMATION 0.281237 SYSTEM 0.143873 PAPER 0.077870 LANGUAGE 0.158786 TEXT 0.048675 FILE 0.054076 CONDITIONS 0.041187 PROGRAMMING 0.097186 RETRIEVAL 0.044046 OPERATING 0.053963 CONCEPT 0.036268 LANGUAGES 0.082410 SOURCES 0.029548 STORAGE 0.039072 CONCEPTS 0.033457 FUNCTIONAL 0.032815 DOCUMENT 0.029000 DISK 0.029957 DISCUSSED 0.027414 SEMANTICS 0.027003 DOCUMENTS 0.026503 SYSTEMS 0.029221 DEFINITION 0.024673 SEMANTIC 0.024341 RELEVANT 0.018523 KERNEL 0.028655 ISSUES 0.024603 NATURAL 0.016410 CONTENT 0.016574 ACCESS 0.018293 PROPERTIES 0.021511 CONSTRUCTS 0.014129 AUTOMATICALLY 0.009326 MANAGEMENT 0.017218 IMPORTANT 0.021370 GRAMMAR 0.013640 DIGITAL 0.008777 UNIX 0.016878 EXAMPLES 0.019754 LISP 0.010326

33 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine 13 14 15 16 MODEL 0.429185 PAPER 0.050411 TYPE 0.088650 KNOWLEDGE 0.212603 MODELS 0.201810 APPROACHES 0.045245 SPECIFICATION 0.051469 SYSTEM 0.090852 MODELING 0.066311 PROPOSED 0.043132 TYPES 0.046571 SYSTEMS 0.051978 QUALITATIVE 0.018417 CHANGE 0.040393 FORMAL 0.036892 BASE 0.042277 COMPLEX 0.009272 BELIEF 0.025835 VERIFICATION 0.029987 EXPERT 0.020172 QUANTITATIVE 0.005662 ALTERNATIVE 0.022470 SPECIFICATIONS 0.024439 ACQUISITION 0.017816 CAPTURE 0.005301 APPROACH 0.020905 CHECKING 0.024439 DOMAIN 0.016638 MODELED 0.005301 ORIGINAL 0.019026 SYSTEM 0.023259 INTELLIGENT 0.015737 ACCURATELY 0.004639 SHOW 0.017852 PROPERTIES 0.018242 BASES 0.015390 REALISTIC 0.004278 PROPOSE 0.016991 ABSTRACT 0.016826 BASED 0.014004 “Style” components

34 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Recent Results on Author-Topic Models

35 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Can we model authors, given documents? (more generally, build statistical profiles of entities given sparse observed data)

36 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions Parameters learned via Bayesian learning

37 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

38 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

39 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

40 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

41 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

42 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics

43 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine “Topic Model”: - document can be generated from multiple topics - Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003) Words Hidden Topics

44 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Authors Words Hidden Topics Model = Author-Topic distributions + Topic-Word distributions NOTE: documents can be composed of multiple topics

45 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The Author-Topic Model: Assumptions of Generative Model Each author is associated with a topics mixture Each document is a mixture of topics With multiple authors, the document will be a mixture of the topics mixtures of the coauthors Each word in a text is generated from one topic and one author (potentially different for each word)

46 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper –A 1 has multinomial topic distribution   –A 2 has multinomial topic distribution   For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from a  X 3.Sample a word w from a multinomial topic distribution  z

47 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Graphical Model 1. Choose an author 2. Choose a topic 3. Choose a word From the set of co-authors …

48 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Model Estimation Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Estimation is efficient: linear in data size Infer: –Author-Topic distributions (  –Topic-Word distributions 

49 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Data 1700 proceedings papers from NIPS (2000+ authors) (NIPS = Neural Information Processing Systems) 160,000 CiteSeer abstracts (85,000+ authors) Removed stop words Word order is irrelevant, just use word counts Processing time: Nips: 2000 Gibbs iterations  12 hours on PC workstation CiteSeer: 700 Gibbs iterations  111 hours

50 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Author Modeling Data Sets SourceDocumentsUnique Authors Unique WordsTotal Word Count CiteSeer163,38985,46530,79911.7 million CORA13,64311,42711,1011.2 million NIPS1,7402,03713,6492.3 million

51 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four example topics from CiteSeer (T=300)

52 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four more topics

53 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: –Topic 1: classification, training, generalization, decision, data,… –Topic 2: learning, machine, examples, reinforcement, inductive,….. –Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

54 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four example topics from NIPS (T=100)

55 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Four more topics

56 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, –Majority of topics are stable over processing time –Majority of topics can be aligned across runs Topics appear to represent genuine structure in data

57 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing NIPS topics from the same chain (t 1 =1000, and t 2 =2000) KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78

58 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Comparing NIPS topics and CiteSeer topics KL distance NIPS topics Re-ordered CiteSeer topics KL = 2.88 KL = 4.48 KL = 4.92 KL = 5.0

59 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Detecting Unusual Papers by Authors For any paper by an author, we can calculate how surprising words in a document are: some papers are on unusual topics by author Papers ranked by unusualness (perplexity) for C. Faloutsos Papers ranked by unusualness (perplexity) for M. Jordan

60 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Author Separation A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Can model attribute words to authors correctly within a document? Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

61 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Applications of Author-Topic Models “Expert Finder” –“Find researchers who are knowledgeable in cryptography and machine learning within 100 miles of Washington DC” –“Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Prediction –Given a document and some subset of known authors for the paper (k=0,1,2…), predict the other authors –Predict how many papers in different topics will appear next year Change Detection/Monitoring –Which authors are on the leading edge of new topics? –Characterize the “topic trajectory” of this author over time

62 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

63 Rise in Web, Mobile, JAVA Web

64 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Rise of Machine Learning

65 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Bayes lives on….

66 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in Languages, OS, …

67 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in CS Theory, …

68 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trends in Database Research

69 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Trends in NLP and IR IR NLP

70 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Security Research Reborn…

71 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine (Not so) Hot Topics Neural Networks GAs Wavelets

72 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Decline in use of Greek Letters 

73 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Future Work Theory development –Incorporate citation information, collaboration networks –Other document types, e.g., email handling subject lines, email threads, and “to” and “cc” fields New datasets: –Enron email corpus –Web pages –PubMed abstracts (possibly)

74 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine New applications of author-topic models Black box for text document collection summarization –Automatically extract a summary of relevant topics and author patterns for a large data set such as Enron email “Expert Finder” –“Find researchers who are knowledgeable in cryptography and machine learning within 100 miles of Washington DC” –“Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Change Detection/Monitoring –Which authors are on the leading edge of new topics? –Characterize the “topic trajectory” of this author over time Prediction (work in progress) –Given a document and some subset of known authors for the paper (k=0,1,2…), predict the other authors –Predict how many papers in different topics will appear next year

75 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine The Author- Topic Browser (b) (a) (c) Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author

76 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Scientific syntax and semantics  z w z z ww x x x semantics: probabilistic topics syntax: probabilistic regular grammar Factorization of language based on statistical dependency patterns: long-range, document specific, dependencies short-range dependencies constant across all documents

77 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 z = 1 0.4 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 z = 2 0.6 x = 1 THE 0.6 A 0.3 MANY 0.1 x = 3 OF 0.6 FOR 0.3 BETWEEN 0.1 x = 2 0.9 0.1 0.2 0.8 0.7 0.3

78 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE ……………………………… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

79 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE LOVE…………………… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

80 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE LOVE OF……………… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

81 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY 0.2 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK 0.2 RESEARCH 0.2 MATHEMATICS 0.2 THE 0.6 A 0.3 MANY 0.1 OF 0.6 FOR 0.3 BETWEEN 0.1 0.9 0.1 0.2 0.8 0.7 0.3 THE LOVE OF RESEARCH …… z = 1 0.4 z = 2 0.6 x = 1 x = 3 x = 2

82 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Semantic topics

83 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED 581425263033 INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY

84 Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine Syntactic classes REMAINED 581425263033 INARETHESUGGESTLEVELSRESULTSBEEN FORWERETHISINDICATENUMBERANALYSISMAY ONWASITSSUGGESTINGLEVELDATACAN BETWEENISTHEIRSUGGESTSRATESTUDIESCOULD DURINGWHENANSHOWEDTIMESTUDYWELL AMONGREMAINEACHREVEALEDCONCENTRATIONSFINDINGSDID FROMREMAINSONESHOWVARIETYEXPERIMENTSDOES UNDERREMAINEDANYDEMONSTRATERANGEOBSERVATIONSDO WITHINPREVIOUSLYINCREASEDINDICATINGCONCENTRATIONHYPOTHESISMIGHT THROUGHOUTBECOMEEXOGENOUSPROVIDEDOSEANALYSESSHOULD THROUGHBECAMEOURSUPPORTFAMILYASSAYSWILL TOWARDBEINGRECOMBINANTINDICATESSETPOSSIBILITYWOULD INTOBUTENDOGENOUSPROVIDESFREQUENCYMICROSCOPYMUST ATGIVETOTALINDICATEDSERIESPAPERCANNOT INVOLVINGMEREPURIFIEDDEMONSTRATEDAMOUNTSWORK THEY AFTERAPPEAREDTILESHOWSRATESEVIDENCEALSO ACROSSAPPEARFULLSOCLASSFINDING AGAINSTALLOWEDCHRONICREVEALVALUESMUTAGENESISBECOME WHENNORMALLYANOTHERDEMONSTRATESAMOUNTOBSERVATIONMAG ALONGEACHEXCESSSUGGESTEDSITESMEASUREMENTSLIKELY


Download ppt "Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic."

Similar presentations


Ads by Google