Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers.

Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers (UCI) Michal Rosen-Zvi (UCI) Tom Griffiths (Stanford)

Outline Problem motivation: Problem motivation: Modeling large sets of documents Modeling large sets of documents Probabilistic approaches Probabilistic approaches topic models -> author-topic models topic models -> author-topic models Results Results Author-topic results from CiteSeer, NIPS, Enron data Author-topic results from CiteSeer, NIPS, Enron data Applications of the model Applications of the model (Demo of author-topic query tool) (Demo of author-topic query tool) Future directions Future directions

Data Sets of Interest Data = set of documents Data = set of documents Large collection of documents: 10k, 100k, etc Large collection of documents: 10k, 100k, etc Know authors of the documents Know authors of the documents Know years/dates of the documents Know years/dates of the documents …… …… (will typically assume bag of words representation) (will typically assume bag of words representation)

Examples of Data Sets CiteSeer: CiteSeer: 160k abstracts, 80k authors, 1986-2002 160k abstracts, 80k authors, 1986-2002 NIPS papers NIPS papers 2k papers, 1k authors, 1987-1999 2k papers, 1k authors, 1987-1999 Reuters Reuters 20k newspaper articles, 114 authors 20k newspaper articles, 114 authors

Pennsylvania Gazette 1728-1800 80,000 articles 25 million words www.accessible.com

Enron email data 500,000 emails 5000 authors 1999-2002

Problems of Interest What topics do these documents “span”? What topics do these documents “span”? Which documents are about a particular topic? Which documents are about a particular topic? How have topics changed over time? How have topics changed over time? What does author X write about? What does author X write about? Who is likely to write about topic Y? Who is likely to write about topic Y? Who wrote this specific document? Who wrote this specific document? and so on….. and so on…..

A topic is represented as a (multinomial) distribution over words

Cluster Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information

Graphical Model z w Cluster Variable Word n words

Graphical Model z w Cluster Variable Word D documents n words

Graphical Model z w  Cluster Variable Word  Cluster-Worddistributions D documents n words ClusterWeights

Cluster Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information DOCUMENT 3 Learning Information Retrieval Probabilistic

Topic Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information

Topic Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information DOCUMENT 3 Learning Information Retrieval Probabilistic

History of topic models Latent class models in statistics (late 60’s) Latent class models in statistics (late 60’s) Hoffman (1999) Hoffman (1999) Original application to documents Original application to documents Blei, Ng, and Jordan (2001, 2003) Blei, Ng, and Jordan (2001, 2003) Variational methods Variational methods Griffiths and Steyvers (2003, 2004) Griffiths and Steyvers (2003, 2004) Gibbs sampling approach (very efficient) Gibbs sampling approach (very efficient)

Word/Document counts for 16 Artificial Documents Can we recover the original topics and topic mixtures from this data? documents

Example of Gibbs Sampling Assign word tokens randomly to topics: Assign word tokens randomly to topics: (●=topic 1; ●=topic 2 )

After 1 iteration Apply sampling equation to each word token Apply sampling equation to each word token

After 4 iterations

After 32 iterations  ● ●

Topic Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information DOCUMENT 3 Learning Information Retrieval Probabilistic

Author-Topic Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information

Author-Topic Models DOCUMENT 1 Learning Learning Bayesian Probabilistic DOCUMENT 2 Retrieval Information Retrieval Information DOCUMENT 3 Learning Information Retrieval Probabilistic

Approach The author-topic model The author-topic model a probabilistic model linking authors and topics a probabilistic model linking authors and topics authors -> topics -> words authors -> topics -> words learned from data learned from data completely unsupervised, no labels completely unsupervised, no labels generative model generative model Different questions or queries can be answered by appropriate probability calculus Different questions or queries can be answered by appropriate probability calculus E.g., p(author | words in document) E.g., p(author | words in document) E.g., p(topic | author) E.g., p(topic | author)

Graphical Model x z Author Topic

x z w Author Topic Word

x z w Author Topic Word n

x z w a Author Topic Word D n

x z w a Author Topic Word   Author-Topicdistributions Topic-Worddistributions D n

Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper Let’s assume authors A 1 and A 2 collaborate and produce a paper A 1 has multinomial topic distribution   A 1 has multinomial topic distribution   A 2 has multinomial topic distribution   A 2 has multinomial topic distribution   For each word in the paper: For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from  X 3.Sample a word w from a multinomial topic distribution  z

Graphical Model x z w a Author Topic Word   Author-Topicdistributions Topic-Worddistributions D n

Learning Observed Observed W = observed words, A = sets of known authors W = observed words, A = sets of known authors Unknown Unknown x, z : hidden variables x, z : hidden variables Θ,  : unknown parameters Θ,  : unknown parameters Interested in: Interested in: p( x, z | W, A) p( x, z | W, A) p( θ,  | W, A) p( θ,  | W, A) But exact inference is not tractable But exact inference is not tractable

Step 1: Gibbs sampling of x and z x z w a Author Topic Word   D n Marginalize over unknown parameters

Step 2: MAP estimates of θ and  x z w a Author Topic Word   D n Condition on particular samples of x and z

Step 2: MAP estimates of θ and  x z w a Author Topic Word   D n Point estimates of unknown parameters

More Details on Learning Gibbs sampling for x and z Gibbs sampling for x and z Typically run 2000 Gibbs iterations Typically run 2000 Gibbs iterations 1 iteration = full pass through all documents 1 iteration = full pass through all documents Estimating θ and  Estimating θ and  x and z sample -> point estimates x and z sample -> point estimates non-informative Dirichlet priors for θ and  non-informative Dirichlet priors for θ and  Computational Efficiency Computational Efficiency Learning is linear in the number of word tokens Learning is linear in the number of word tokens Predictions on new documents Predictions on new documents can average over θ and  (from different samples, different runs) can average over θ and  (from different samples, different runs)

Gibbs Sampling Need full conditional distributions for variables Need full conditional distributions for variables The probability of assigning the current word i to topic j and author k given everything else: The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

Experiments on Real Data Corpora Corpora CiteSeer:160K abstracts, 85K authors CiteSeer:160K abstracts, 85K authors NIPS:1.7K papers, 2K authors NIPS:1.7K papers, 2K authors Enron:115K emails, 5K authors (sender) Enron:115K emails, 5K authors (sender) Pubmed:27K abstracts,50K authors Pubmed:27K abstracts,50K authors Removed stop words; no stemming Removed stop words; no stemming Ignore word order, just use word counts Ignore word order, just use word counts Processing time: Processing time: Nips: 2000 Gibbs iterations  8 hours CiteSeer: 2000 Gibbs iterations  4 days

Four example topics from CiteSeer (T=300)

More CiteSeer Topics

Some topics relate to generic word usage

What can the Model be used for? We can analyze our document set through the “topic lens” We can analyze our document set through the “topic lens” Applications Applications Queries Queries Who writes on this topic? Who writes on this topic? e.g., finding experts or reviewers in a particular area e.g., finding experts or reviewers in a particular area What topics does this person do research on? What topics does this person do research on? Discovering trends over time Discovering trends over time Detecting unusual papers and authors Detecting unusual papers and authors Interactive browsing of a digital library via topics Interactive browsing of a digital library via topics Parsing documents (and parts of documents) by topic Parsing documents (and parts of documents) by topic and more….. and more…..

Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,… Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,… Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Temporal patterns in topics: hot and cold topics We have CiteSeer papers from 1986-2002 We have CiteSeer papers from 1986-2002 For each year, calculate the fraction of words assigned to each topic For each year, calculate the fraction of words assigned to each topic -> a time-series for topics -> a time-series for topics Hot topics become more prevalent Hot topics become more prevalent Cold topics become less prevalent Cold topics become less prevalent

Four example topics from NIPS (T=100)

NIPS: support vector topic

NIPS: neural network topic

Pennsylvania Gazette Data (courtesy of David Newman, UC Irvine)

Enron email data 500,000 emails 5000 authors 1999-2002

Enron email topics

Non-work Topics…

Topical Topics

Enron email: California Energy Crisis Message-ID: Message-ID: Date: Fri, 27 Apr 2001 09:25:00 -0700 (PDT) Subject: California Update 4/27/01 …………. …………. FERC price cap decision reflects Bush political and economic objectives. Politically, Bush is determined to let the crisis blame fall on Davis; from an economic perspective, he is unwilling to create disincentives for new power generation The FERC decision is a holding move by the Bush administration that looks like action, but is not. Rather, it allows the situation in California to continue to develop virtually unabated. The political strategy appears to allow the situation to deteriorate to the point where Davis cannot escape shouldering the blame. Once they are politically inoculated, the Administration can begin to look at regional solutions. Moreover, the Administration has already made explicit (and will certainly restate in the forthcoming Cheney commission report) its opposition to stronger price caps …..

Enron email: US Senate Bill Message-ID: Message-ID: Date: Thu, 15 Jun 2000 08:59:00 -0700 (PDT) From: *************** To: *************** Subject: Senate Commerce Committee Pipeline Safety Markup The Senate Commerce Committee held a markup today where Senator John McCain's (R-AZ) pipeline safety legislation, S. 2438, was approved. The overall outcome was not unexpected -- the final legislation contained several provisions that went a little bit further than Enron and INGAA would have liked, …………… 2) McCain amendment to Section 13 (b) (on operator assistance investigations) -- Approved by voice vote. ……. 3) Sen. John Kerry (D-MA) Amendment on Enforcement -- Approved by voice vote. Another confusing vote, in which many members did not understand the changes being made, but agreed to it on the condition that clarifications be made before Senate floor action. Late last night, Enron led a group including companies from INGAA and AGA in providing comments to Senator Kerry which caused him to make substantial changes to his amendment before it was voted on at markup, including dropping provisions allowing citizen suits and other troubling issues. In the end, the amendment that passed was acceptable to industry.

Enron email: political donations 10/16/2000 04:41 PM Subject: Ashcroft Senate Campaign Request Subject: Ashcroft Senate Campaign Request We have received a request from the Ashcroft Senate campaign for $10,000 in soft money. This is the race where Governor Carnahan is the challenger. Enron PAC has contributed $10,000 and Enron has also contributed $15,000 soft money in this campaign to Senator Ashcroft. Ken Lay has been personally interested in the Ashcroft campaign. Our polling information is that Ashcroft is currently leading 43 to 38 with an undecided of 19 percent. ………………………………………………………………………………………………………………………………… Message-ID: Message-ID: Date: Mon, 16 Oct 2000 14:13:00 -0700 (PDT) From: ***** To: ***** Subject: Re: Ashcroft Senate Campaign Request If you can cover it I would say yes. It's a key race and we have been close to Ashcroft for years. Let's make sure he knows we gave it.... we need to follow up with him. Last time I talked to him he basically recited the utilities' position on electric restructuring. Let's make it clear that we want to talk right after the election.

PubMed-Query Topics

PubMed-Query Author Model P. M. Lindeque, South Africa P. M. Lindeque, South AfricaTOPICS Topic 1: water, natural, foci, environmental, sourceprob=0.33 Topic 1: water, natural, foci, environmental, sourceprob=0.33 Topic 2: anthracis, anthrax, bacillus, spores, cereusprob=0.13 Topic 2: anthracis, anthrax, bacillus, spores, cereusprob=0.13 Topic 3: species, sp, isolated, populations, testedprob=0.06 Topic 3: species, sp, isolated, populations, testedprob=0.06 Topic 4: epidemic, occurred, outbreak, personsprob=0.06 Topic 4: epidemic, occurred, outbreak, personsprob=0.06 Topic 5: positive, samples, negative, testedprob=0.05 Topic 5: positive, samples, negative, testedprob=0.05PAPERS Vaccine-induced protections against anthrax in cheetah Vaccine-induced protections against anthrax in cheetah Airborne movement of anthrax spores from carcass sites in the Etosha National Park Airborne movement of anthrax spores from carcass sites in the Etosha National Park Ecology and epidemiology of anthrax in the Etosha National Park Ecology and epidemiology of anthrax in the Etosha National Park Serology and anthrax in humans, livestock, and wildlife Serology and anthrax in humans, livestock, and wildlife

PubMed-Query: Topics by Country

3 of 300 example topics (TASA)

Word sense disambiguation (numbers & colors  topic assignments)

Finding unusual papers for an author Perplexity = exp [entropy (words | model) ] = measure of surprise for model on data = measure of surprise for model on data Can calculate perplexity of unseen documents, conditioned on the model for a particular author conditioned on the model for a particular author

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 MEDIAN PERPLEXITY 2567

Papers and Perplexities: M_Jordan Factorial Hidden Markov Models 687 Learning from Incomplete Data 702 MEDIAN PERPLEXITY 2567 Defining and Handling Transient Fields in Pjama 14555 An Orthogonally Persistent JAVA 16021

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY 2837

Papers and Perplexities: T_Mitchell Explanation-based Learning for Mobile Robot Perception 1093 Learning to Extract Symbolic Knowledge from the Web 1196 MEDIAN PERPLEXITY 2837 Text Classification from Labeled and Unlabeled Documents using EM 3802 A Method for Estimating Occupational Radiation Dose… 8814

Author prediction with CiteSeer Task: predict (single) author of new CiteSeer abstracts Task: predict (single) author of new CiteSeer abstracts Results: Results: For 33% of documents, author guessed correctly For 33% of documents, author guessed correctly Median rank of true author = 26 (out of 85,000) Median rank of true author = 26 (out of 85,000)

Who wrote what? A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

The Author-Topic Browser Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author

Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, However, Majority of topics are stable over processing time Majority of topics are stable over processing time Majority of topics can be aligned across runs Majority of topics can be aligned across runs Topics appear to represent genuine structure in data Topics appear to represent genuine structure in data

Comparing NIPS topics from the same Markov chain KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78

Comparing NIPS topics from two different Markov chains KL distance topics from chain 1 Re-ordered topics from chain 2 BEST KL = 1.03

Gibbs Sampler Stability (NIPS data)

New Applications/ Future Work Reviewer Recommendation Reviewer Recommendation “Find reviewers for this set of grant proposals who are active in relevant topics and have no conflicts of interest” “Find reviewers for this set of grant proposals who are active in relevant topics and have no conflicts of interest” Change Detection/Monitoring Change Detection/Monitoring Which authors are on the leading edge of new topics? Which authors are on the leading edge of new topics? Characterize the “topic trajectory” of this author over time Characterize the “topic trajectory” of this author over time Author Identification Author Identification Who wrote this document? Incorporation of stylistic information (stylometry) Who wrote this document? Incorporation of stylistic information (stylometry) Additions to the model Additions to the model Modeling citations Modeling citations Modeling topic persistence in a document Modeling topic persistence in a document ….. …..

Summary Topic models are a versatile probabilistic model for text data Topic models are a versatile probabilistic model for text data Author-topic models are a very useful generalization Author-topic models are a very useful generalization Equivalent to topics model with 1 different author per document Equivalent to topics model with 1 different author per document Learning has linear time complexity Learning has linear time complexity Gibbs sampling is practical on very large data sets Gibbs sampling is practical on very large data sets Experimental results Experimental results On multiple large complex data sets, the resulting topic-word and author-topic models are quite interpretable On multiple large complex data sets, the resulting topic-word and author-topic models are quite interpretable Results appear stable relative to sampling Results appear stable relative to sampling Numerous possible applications…. Numerous possible applications…. Current model is quite simple….many extensions possible Current model is quite simple….many extensions possible

Further Information www.datalab.uci.edu www.datalab.uci.edu www.datalab.uci.edu Steyvers et al, ACM SIGKDD 2004 Steyvers et al, ACM SIGKDD 2004 Rosen-Zvi et al, UAI 2004 Rosen-Zvi et al, UAI 2004 www.datalab.uci.edu/author-topic www.datalab.uci.edu/author-topic www.datalab.uci.edu/author-topic JAVA demo of online browser JAVA demo of online browser additional tables and results additional tables and results

BACKUP SLIDES

Author-Topics Model x z w a Author Topic Word   Author-Topicdistributions Topic-Worddistributions D n

Topics Model: Topics, no Authors x z w Author Topic Word   Document-TopicDistributions Topic-Worddistributions D n

Author Model: Authors, no Topics a w a Author Word  D n Author-WordDistributions

Comparison Results Train models on part of a new document and predict remaining words Train models on part of a new document and predict remaining words Without having seen any words from new document, author- topic information helps in predicting words from that document Without having seen any words from new document, author- topic information helps in predicting words from that document Topics model is more flexible in adapting to new document after observing a number of words Topics model is more flexible in adapting to new document after observing a number of words

Latent Semantic Analysis (Landauer & Dumais, 1997) Words with similar co-occurence patterns across documents end up with similar vector representations word/document countshigh dimensional space SVD RIVER STREAM MONEY BANK 1…16…0…MONEY… 6195BANK 0012STREAM 0034RIVER Doc3 … Doc2Doc1

LSA Geometric Geometric Partially generative Partially generative Dimensions are not interpretable Dimensions are not interpretable Little flexibility to expand model (e.g., syntax) Little flexibility to expand model (e.g., syntax) Topics Probabilistic Fully generative Topic dimensions are often interpretable Modular language of bayes nets/ graphical models

Modeling syntax and semantics (Steyvers, Griffiths, Blei, and Tenenbaum)  z w z z ww x x x semantics: probabilistic topics syntax: 3 rd order HMM long-range, document specific, dependencies short-range dependencies constant across all documents

Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers.

Similar presentations

Presentation on theme: "Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers.

Similar presentations

Presentation on theme: "Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers."— Presentation transcript:

Similar presentations

About project

Feedback