Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

These viewgraphs were developed by Professor Mark Steyvers and are intended for review by ICS 278 students. If you wish to use them for any other purposes please contact Professor Smyth (smyth@ics.uci.edu) or Professor Steyvers (msteyver@uci.edu) smyth@ics.uci.edu

Goal Automatically extract topical content of documents Automatically extract topical content of documents Learn association of topics to authors of documents Learn association of topics to authors of documents Propose new efficient probabilistic topic model: the author-topic model Propose new efficient probabilistic topic model: the author-topic model Some queries that model should be able to answer: Some queries that model should be able to answer: What topics does author X work on? What topics does author X work on? Which authors work on topic X? Which authors work on topic X? What are interesting temporal patterns in topics? What are interesting temporal patterns in topics?

A topic is represented as a (multinomial) distribution over words

Documents as Topics Mixtures: a Geometric Interpretation P(word3) P(word1) 0 1 1 1 P(word2) P(word1)+P(word2)+P(word3) = 1 topic 1 topic 2 = document

Previous topic-based models Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI) Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI) EM implementation EM implementation Problem of overfitting Problem of overfitting Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA) Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA) Clarified the pLSI model Clarified the pLSI model Variational EM Variational EM Griffiths & Steyvers, (PNAS 2004) Griffiths & Steyvers, (PNAS 2004) Same generative model as LDA Same generative model as LDA Gibbs sampling technique for inference Gibbs sampling technique for inference Computationally simple Computationally simple Efficient (linear with size of data) Efficient (linear with size of data) Can be applied to >100K documents Can be applied to >100K documents

Approach with Author-Topic Models Combine author models with topic models Combine author models with topic models Ignore style, focus on content of document Ignore style, focus on content of document Learn the topics that authors write about Learn the topics that authors write about Learn two matrices: Learn two matrices: Authors Topics Words

Assumptions of Generative Model Each author is associated with a topics mixture Each author is associated with a topics mixture Each document contains a mixture of topics Each document contains a mixture of topics With multiple authors, the document will express a mixture of the topics mixtures of the co-authors With multiple authors, the document will express a mixture of the topics mixtures of the co-authors Each word in a text is generated from one topic and one author (potentially different for each word) Each word in a text is generated from one topic and one author (potentially different for each word)

Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper Let’s assume authors A 1 and A 2 collaborate and produce a paper A 1 has multinomial topic distribution   A 1 has multinomial topic distribution   A 2 has multinomial topic distribution   A 2 has multinomial topic distribution   For each word in the paper: For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from a  X 3.Sample a word w from a multinomial topic distribution  z

Graphical Model 1. Choose an author 2. Choose a topic 3. Choose a word From the set of co-authors … Matrix of author-topic distributions Matrix of topic-word distributions

Model Estimation Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Integrate out  and  Integrate out  and  Estimation is efficient: linear in data size Estimation is efficient: linear in data size Infer: Infer: Author-Topic distributions (  Author-Topic distributions (  Topic-Word distributions  Topic-Word distributions 

Gibbs sampling in Author-Topics Need full conditional distributions for variables Need full conditional distributions for variables The probability of assigning the current word i to topic j and author k given everything else: The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

Gibbs sampling procedure

Start with random assignments to topics/authors

Use all previous assignments, except for current word-token

Sample topic and author, and move to next word-token

Collect samples after >1000 iterations

Data Corpora Corpora CiteSeer:160K abstracts, 85K authors CiteSeer:160K abstracts, 85K authors NIPS:1.7K papers, 2K authors NIPS:1.7K papers, 2K authors Enron:115K emails, 5K authors (sender) Enron:115K emails, 5K authors (sender) Removed stop words; no stemming Removed stop words; no stemming Word order is irrelevant, just use word counts Word order is irrelevant, just use word counts Processing time: Processing time: Nips: 2000 Gibbs iterations  12 hours on PC workstation CiteSeer: 700 Gibbs iterations  111 hours

Four example topics from CiteSeer (T=300)

Four more topics

Some topics relate to generic word usage

Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,… Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,… Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Four example topics from NIPS (T=100)

ENRON Email: two example topics (T=100)

ENRON Email: two topics not about Enron

Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, However, Majority of topics are stable over processing time Majority of topics are stable over processing time Majority of topics can be aligned across runs Majority of topics can be aligned across runs Topics represent genuine structure in data Topics represent genuine structure in data

Comparing NIPS topics from the same Markov chain KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78

Comparing NIPS topics from two different Markov chains KL distance topics from chain 1 Re-ordered topics from chain 2 BEST KL = 1.03 WORST KL = 9.49

Detecting Papers on Unusual Topics for Authors We can calculate perplexity (unusualness) for words in a document given an author We can calculate perplexity (unusualness) for words in a document given an author Papers ranked by perplexity for M. Jordan:

Author Separation A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Can model attribute words to authors correctly within a document? Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

Temporal patterns in topics: hot and cold topics We have CiteSeer papers from 1986-2001 We have CiteSeer papers from 1986-2001 We can calculate time-series for topics We can calculate time-series for topics Hot topics become more prevalent Hot topics become more prevalent Cold topics become less prevalent Cold topics become less prevalent Do time-series correspond with known trends in computer science? Do time-series correspond with known trends in computer science?

Hot Topic: machine learning, data mining

The inevitability of Bayes…

Rise in Web/Mobile topics

(Not so) Hot Topics

Decline in programming languages, OS, ….

Security research reborn….

Decrease in use of Greek Letters Decrease in use of Greek Letters

Burst of French writing in mid 90’s?

Comparison to models that use less information (topics, no authors) (authors, no topics) Topics model Author model

Matrix Factorization Interpretation Authors Topics Documents Words = Documents Authors A Topics Words AUTHOR-TOPIC MODEL Documents Topics Documents Words = Topics Words TOPIC MODEL Documents Words = Documents Authors A Author Words AUTHOR MODEL

Comparison Results Train models on part of a new document and predict remaining words Train models on part of a new document and predict remaining words Without having seen any words from new document, author-topic information helps in predicting words from that document Without having seen any words from new document, author-topic information helps in predicting words from that document Topics model is more flexible in adapting to new document after observing a number of words Topics model is more flexible in adapting to new document after observing a number of words

Author prediction with CiteSeer Task: predict (single) author of new CiteSeer abstracts Task: predict (single) author of new CiteSeer abstracts Results: Results: For 33% of documents, author guessed correctly For 33% of documents, author guessed correctly Median rank of true author = 26 (out of 85,000) Median rank of true author = 26 (out of 85,000)

Perplexities for true author and any random author A = true author A = any author

The Author- Topic Browser (b) (a) (c) Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author http://www.ics.uci.edu/~michal/KDD/ATM.htm

New Applications/ Future Work Finding relevant email: Finding relevant email: "find emails similar to this email based on content” "find emails similar to this email based on content” "find people who wrote emails similar in content to this one" "find people who wrote emails similar in content to this one" Reviewer Recommendation Reviewer Recommendation “Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” “Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Change Detection/Monitoring Change Detection/Monitoring Which authors are on the leading edge of new topics? Which authors are on the leading edge of new topics? Characterize the “topic trajectory” of this author over time Characterize the “topic trajectory” of this author over time Author Identification Author Identification Who wrote this document? Incorporation of stylistic information Who wrote this document? Incorporation of stylistic information

Comparing NIPS topics and CiteSeer topics KL distance NIPS topics Re-ordered CiteSeer topics KL = 2.88 KL = 4.48 KL = 4.92 KL = 5.0

Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

Similar presentations

Presentation on theme: "Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

Similar presentations

Presentation on theme: "Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)"— Presentation transcript:

Similar presentations

About project

Feedback