Download presentation
Presentation is loading. Please wait.
Published byShanon Gibbs Modified over 9 years ago
1
Statistical Modeling of Large Text Collections Padhraic Smyth Department of Computer Science University of California, Irvine MURI Project Kick-off Meeting November 18th 2008
2
2 The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis Categorization/classification Automated summarization Machine translation Information extraction And so on….
3
3 The Text Revolution Widespread availability of text in digital form is driving many new applications based on automated text analysis Categorization/classification Automated summarization Machine translation Information extraction And so on…. Most of this work is happening in computing, but many of the underlying techniques are statistical
4
4 NYT 1.5 million articles 16 million Medline articles Motivation Pennsylvania Gazette 80,000 articles 1728-1800
5
5 Problems of Interest What topics do these documents “span”? Which documents are about a particular topic? How have topics changed over time? What does author X write about? and so on…..
6
6 Problems of Interest What topics do these documents “span”? Which documents are about a particular topic? How have topics changed over time? What does author X write about? and so on….. Key Ideas: Learn a probabilistic model over words and docs Treat query-answering as computation of appropriate conditional probabilities
7
7 Topic Models for Documents P( words | document ) = ?? = P(words|topic) P (topic|document) Topic = probability distribution over words Coefficients for each document Automatically learned from text corpus
8
8 Topics = Multinomials over Words
9
9
10
10 Basic Concepts Topics = distributions over words Unknown a priori, learned from data Documents represented as mixtures of topics Learning algorithm Gibbs sampling (stochastic search) Linear time per iteration Provides a full probabilistic model over words, documents, and topics Query answering = computation of conditional probabilities
11
11 Enron email data 250,000 emails 28,000 individuals 1999-2002
12
12 Enron email: business topics
13
13 Enron: non-work topics…
14
14 Enron: public-interest topics...
15
15 Examples of Topics from New York Times WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDAY DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NASDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10 500_STOCK_INDEX WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIES RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRMS SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING INVESTMENT_BANKERS INVESTMENT_BANKS SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTACK NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATIONAL QAEDA TERRORIST_ATTACKS BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_COURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPANIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CASE GROUP TerrorismWall Street FirmsStock MarketBankruptcy
16
16 Topic trends from New York Times TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE Tour-de-France COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNING ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BUILDING Anthrax 330,000 articles 2000-2002 Quarterly Earnings
17
17 What does an author write about? Author = Jerry Friedman, Stanford:
18
18 What does an author write about? Author = Jerry Friedman, Stanford: Topic 1: regression, estimate, variance, data, series,… Topic 2: classification, training, accuracy, decision, data,… Topic 3: distance, metric, similarity, measure, nearest,…
19
19 What does an author write about? Author = Jerry Friedman, Stanford: Topic 1: regression, estimate, variance, data, series,… Topic 2: classification, training, accuracy, decision, data,… Topic 3: distance, metric, similarity, measure, nearest,… Author = Rakesh Agrawal, IBM:
20
20 What does an author write about? Author = Jerry Friedman, Stanford: Topic 1: regression, estimate, variance, data, series,… Topic 2: classification, training, accuracy, decision, data,… Topic 3: distance, metric, similarity, measure, nearest,… Author = Rakesh Agrawal, IBM: - Topic 1: index, data, update, join, efficient…. - Topic 2: query, database, relational, optimization, answer…. - Topic 3: data, mining, association, discovery, attributes,…
21
21 Examples of Data Sets Modeled 1,200 Bible chapters (KJV) 4,000 Blog entries 20,000 PNAS abstracts 80,000 Pennsylvania Gazette articles 250,000 Enron emails 300,000 North Carolina vehicle accident police reports 500,000 New York Times articles 650,000 CiteSeer abstracts 8 million MEDLINE abstracts Books by Austen, Dickens, and Melville ….. Exactly the same algorithm used in all cases – and in all cases interpretable topics produced automatically
22
22 Related Work Statistical origins Latent class models in statistics (late 60’s) Admixture models in genetics LDA Model: Blei, Ng, and Jordan (2003) Variational EM Topic Model: Griffiths and Steyvers (2004) Collapsed Gibbs sampler Alternative approaches Latent semantic indexing (LSI/LSA) less interpretable, not appropriate for count data Document clustering: simpler but less powerful
23
23 Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.
24
24 Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov One Cluster
25
25 Clusters v. Topics Hidden Markov Models in Molecular Biology: New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. [cluster 88] model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov [topic 10] state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling [topic 37] genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes Multiple Topics One Cluster
26
26 Extensions Author-topic models Authors = mixtures over topics (Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004) Special-words model Documents = mixtures of topics + idiosyncratic words (Chemudugunta, Smyth, Steyvers, 2006) Entity-topic models Topic models that can reason about entities (Newman, Chemudugunta, Smyth, Steyvers, 2006) See also work by McCallum, Blei, Buntine, Welling, Fienberg, Xing, etc Probabilistic basis allows for a wide range of generalizations
27
27 Combining Models for Networks and Text
28
28 Combining Models for Networks and Text
29
29 Combining Models for Networks and Text
30
30 Combining Models for Networks and Text
31
31 Technical Approach and Challenges Develop flexible probabilistic network models that can incorporate textual information e.g., ERGMs with text as node or edge covariates e.g., latent space models with text-based covariates e.g., dynamic relational models with text as edge covariates Research challenges Computational scalability ERGMS not directly applicable to large text data sets What text representation to use: High-dimensional “bag of words” ? Low-dimensional latent topics ? Utility of text Does the incorporation of textual information produce more accurate models or predictions? How can this be quantified?
32
Graphical Model z Group Variable Word 1 Word n Word 2..........
33
Graphical Model z w Group Variable Word n words
34
Graphical Model z w Group Variable Word D documents n words
35
Mixture Model for Documents z w Group Variable Word Group-Worddistributions D documents n words GroupProbabilities
36
Clustering with a Mixture Model z w Cluster Variable Word Cluster-Worddistributions D documents n words ClusterProbabilities
37
37 Graphical Model for Topics z w Topic Word Document-Topic distributions Topic-Word distributions D n
38
38 Learning via Gibbs sampling z w Topic Word Document-Topic distributions Topic-Word distributions D n Gibbs sampler to estimate z for each word occurrence, …… marginalizing over other parameters
39
39 More Details on Learning Gibbs sampling for word-topic assignments (z) 1 iteration = full pass through all words in all documents Typically run a few hundred Gibbs iterations Estimating θ and use z samples to get point estimates non-informative Dirichlet priors for θ and Computational Efficiency Learning is linear in the number of word tokens Can still take order of a day on 100k or more docs
40
40 Gibbs Sampler Stability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.