Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting.

Slides:



Advertisements
Similar presentations
Scaling Up Graphical Model Inference
Advertisements

Pattern Finding and Pattern Discovery in Time Series
Topic models Source: Topic models, David Blei, MLSS 09.
Cross-Corpus Analysis with Topic Models Padhraic Smyth, Mark Steyvers, Dave Newman, Chaitanya Chemudugunta University of California, Irvine New York Times.
Title: The Author-Topic Model for Authors and Documents
Dynamic Bayesian Networks (DBNs)
Supervised Learning Recap
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
LDA Training System 8/22/2012.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.
Hidden Markov Models Hidden Markov Models Supplement to the Probabilistic Graphical Models Course 2009 School of Computer Science and Engineering Seoul.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Analyzing unstructured text with topic models Mark Steyvers Dep. of Cognitive Sciences & Dep. of Computer Science University of California, Irvine collaborators:
Latent Dirichlet Allocation a generative model for text
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
British Museum Library, London Picture Courtesy: flickr.
Lecture 16: Unsupervised Learning from Text Padhraic Smyth Department of Computer Science University of California, Irvine.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)
Scalable Text Mining with Sparse Generative Models
Probabilistic Topic Models Mark Steyvers Department of Cognitive Sciences University of California, Irvine Joint work with: Tom Griffiths, UC Berkeley.
Data Mining Techniques
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Protein Sequence Alignment and Database Searching.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Graphical models for part of speech tagging
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
CS Statistical Machine learning Lecture 24
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Data Mining and Decision Support
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Online Multiscale Dynamic Topic Models
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Presentation transcript:

Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting November 18th 2008

2 The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis Categorization/classification Automated summarization Machine translation Information extraction And so on….

3 The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis Categorization/classification Automated summarization Machine translation Information extraction And so on…. Most of this work is happening in computing, but many of the underlying techniques are statistical

4 NYT 1.5 million articles 16 million Medline articles Motivation Pennsylvania Gazette 80,000 articles

5 Problems of Interest What topics do these documents “span”? Which documents are about a particular topic? How have topics changed over time? What does author X write about? and so on…..

6 Problems of Interest What topics do these documents “span”? Which documents are about a particular topic? How have topics changed over time? What does author X write about? and so on….. Key Ideas: Learn a probabilistic model over words and docs Treat query-answering as computation of appropriate conditional probabilities

7 Topic Models for Documents P( words | document ) = ?? =  P(words|topic) P (topic|document) Topic = probability distribution over words Coefficients for each document Automatically learned from text corpus

8 Topics = Multinomials over Words

9

10 Basic Concepts Topics = distributions over words Unknown a priori, learned from data Documents represented as mixtures of topics Learning algorithm Gibbs sampling (stochastic search) Linear time per iteration Provides a full probabilistic model over words, documents, and topics Query answering = computation of conditional probabilities

11 Enron data 250,000 s 28,000 individuals

12 Enron business topics

13 Enron: non-work topics…

14 Enron: public-interest topics...

15 Examples of Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN _STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy

16 Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles Quarterly Earnings

17 What does an author write about? Author = Jerry Friedman, Stanford:

18 What does an author write about? Author = Jerry Friedman, Stanford: Topic 1: regression, estimate, variance, data, series,… Topic 2: classification, training, accuracy, decision, data,… Topic 3: distance, metric, similarity, measure, nearest,…

19 What does an author write about? Author = Jerry Friedman, Stanford: Topic 1: regression, estimate, variance, data, series,… Topic 2: classification, training, accuracy, decision, data,… Topic 3: distance, metric, similarity, measure, nearest,… Author = Rakesh Agrawal, IBM:

20 What does an author write about? Author = Jerry Friedman, Stanford: Topic 1: regression, estimate, variance, data, series,… Topic 2: classification, training, accuracy, decision, data,… Topic 3: distance, metric, similarity, measure, nearest,… Author = Rakesh Agrawal, IBM: - Topic 1: index, data, update, join, efficient…. - Topic 2: query, database, relational, optimization, answer…. - Topic 3: data, mining, association, discovery, attributes,…

21 Examples of Data Sets Modeled 1,200 Bible chapters (KJV) 4,000 Blog entries 20,000 PNAS abstracts 80,000 Pennsylvania Gazette articles 250,000 Enron s 300,000 North Carolina vehicle accident police reports 500,000 New York Times articles 650,000 CiteSeer abstracts 8 million MEDLINE abstracts Books by Austen, Dickens, and Melville ….. Exactly the same algorithm used in all cases – and in all cases interpretable topics produced automatically

22 Related Work Statistical origins Latent class models in statistics (late 60’s) Admixture models in genetics LDA Model: Blei, Ng, and Jordan (2003) Variational EM Topic Model: Griffiths and Steyvers (2004) Collapsed Gibbs sampler Alternative approaches Latent semantic indexing (LSI/LSA) less interpretable, not appropriate for count data Document clustering: simpler but less powerful

23 Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.

24 Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov One Cluster

25 Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster

26 Extensions Author-topic models Authors = mixtures over topics (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004) Special-words model Documents = mixtures of topics + idiosyncratic words (Chemudugunta, Smyth, Steyvers, 2006) Entity-topic models Topic models that can reason about entities (Newman, Chemudugunta, Smyth, Steyvers, 2006) See also work by McCallum, Blei, Buntine, Welling, Fienberg, Xing, etc Probabilistic basis allows for a wide range of generalizations

27 Combining Models for Networks and Text

28 Combining Models for Networks and Text

29 Combining Models for Networks and Text

30 Combining Models for Networks and Text

31 Technical Approach and Challenges Develop flexible probabilistic network models that can incorporate textual information e.g., ERGMs with text as node or edge covariates e.g., latent space models with text-based covariates e.g., dynamic relational models with text as edge covariates Research challenges Computational scalability ERGMS not directly applicable to large text data sets What text representation to use: High-dimensional “bag of words” ? Low-dimensional latent topics ? Utility of text Does the incorporation of textual information produce more accurate models or predictions? How can this be quantified?

Graphical Model z Group Variable Word 1 Word n Word

Graphical Model z w Group Variable Word n words

Graphical Model z w Group Variable Word D documents n words

Mixture Model for Documents z w  Group Variable Word  Group-Worddistributions D documents n words GroupProbabilities

Clustering with a Mixture Model z w  Cluster Variable Word  Cluster-Worddistributions D documents n words ClusterProbabilities

37 Graphical Model for Topics z w Topic Word   Document-Topic distributions Topic-Word distributions D n

38 Learning via Gibbs sampling z w Topic Word   Document-Topic distributions Topic-Word distributions D n Gibbs sampler to estimate z for each word occurrence, …… marginalizing over other parameters

39 More Details on Learning Gibbs sampling for word-topic assignments (z) 1 iteration = full pass through all words in all documents Typically run a few hundred Gibbs iterations Estimating θ and  use z samples to get point estimates non-informative Dirichlet priors for θ and  Computational Efficiency Learning is linear in the number of word tokens Can still take order of a day on 100k or more docs

40 Gibbs Sampler Stability