Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information retrieval – LSI, pLSI and LDA
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Probabilistic Clustering-Projection Model for Discrete Data
Generative Topic Models for Community Analysis
Pattern Recognition and Machine Learning
Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.
Principal Component Analysis
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction Padhraic.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Author-Topic Models for Large Text Corpora Padhraic Smyth Department of Computer Science University of California, Irvine In collaboration with: Mark Steyvers.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Latent Dirichlet Allocation a generative model for text
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
British Museum Library, London Picture Courtesy: flickr.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Probabilistic Latent Semantic Analysis
Topic Trends from CiteSeer Data Michal Rosen-Zvi Padhraic Smyth Mark Steyvers.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.
Integrating Topics and Syntax -Thomas L
SINGULAR VALUE DECOMPOSITION (SVD)
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Web-Mining Agents Topic Analysis: pLSI and LDA
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Online Multiscale Dynamic Topic Models
School of Computer Science & Engineering
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
Efficient Estimation of Word Representation in Vector Space
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Text Categorization Berlin Chen 2003 Reference:
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Models for Authors and Text Documents Mark Steyvers UCI In collaboration with: Padhraic Smyth (UCI) Michal Rosen-Zvi (UCI) Thomas Griffiths (Stanford)

These viewgraphs were developed by Professor Mark Steyvers and are intended for review by ICS 278 students. If you wish to use them for any other purposes please contact Professor Smyth or Professor Steyvers

Goal Automatically extract topical content of documents Automatically extract topical content of documents Learn association of topics to authors of documents Learn association of topics to authors of documents Propose new efficient probabilistic topic model: the author-topic model Propose new efficient probabilistic topic model: the author-topic model Some queries that model should be able to answer: Some queries that model should be able to answer: What topics does author X work on? What topics does author X work on? Which authors work on topic X? Which authors work on topic X? What are interesting temporal patterns in topics? What are interesting temporal patterns in topics?

A topic is represented as a (multinomial) distribution over words

Documents as Topics Mixtures: a Geometric Interpretation P(word3) P(word1) P(word2) P(word1)+P(word2)+P(word3) = 1 topic 1 topic 2 = document

Previous topic-based models Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI) Hoffman (1999): Probabilistic Latent Semantic Indexing (pLSI) EM implementation EM implementation Problem of overfitting Problem of overfitting Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA) Blei, Ng, & Jordan (2003): Latent Dirichlet Allocation (LDA) Clarified the pLSI model Clarified the pLSI model Variational EM Variational EM Griffiths & Steyvers, (PNAS 2004) Griffiths & Steyvers, (PNAS 2004) Same generative model as LDA Same generative model as LDA Gibbs sampling technique for inference Gibbs sampling technique for inference Computationally simple Computationally simple Efficient (linear with size of data) Efficient (linear with size of data) Can be applied to >100K documents Can be applied to >100K documents

Approach with Author-Topic Models Combine author models with topic models Combine author models with topic models Ignore style, focus on content of document Ignore style, focus on content of document Learn the topics that authors write about Learn the topics that authors write about Learn two matrices: Learn two matrices: Authors Topics Words

Assumptions of Generative Model Each author is associated with a topics mixture Each author is associated with a topics mixture Each document contains a mixture of topics Each document contains a mixture of topics With multiple authors, the document will express a mixture of the topics mixtures of the co-authors With multiple authors, the document will express a mixture of the topics mixtures of the co-authors Each word in a text is generated from one topic and one author (potentially different for each word) Each word in a text is generated from one topic and one author (potentially different for each word)

Generative Process Let’s assume authors A 1 and A 2 collaborate and produce a paper Let’s assume authors A 1 and A 2 collaborate and produce a paper A 1 has multinomial topic distribution   A 1 has multinomial topic distribution   A 2 has multinomial topic distribution   A 2 has multinomial topic distribution   For each word in the paper: For each word in the paper: 1.Sample an author x (uniformly) from A 1, A 2 2.Sample a topic z from a  X 3.Sample a word w from a multinomial topic distribution  z

Graphical Model 1. Choose an author 2. Choose a topic 3. Choose a word From the set of co-authors … Matrix of author-topic distributions Matrix of topic-word distributions

Model Estimation Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Estimate x and z by Gibbs sampling (assignments of each word to an author and topic) Integrate out  and  Integrate out  and  Estimation is efficient: linear in data size Estimation is efficient: linear in data size Infer: Infer: Author-Topic distributions (  Author-Topic distributions (  Topic-Word distributions  Topic-Word distributions 

Gibbs sampling in Author-Topics Need full conditional distributions for variables Need full conditional distributions for variables The probability of assigning the current word i to topic j and author k given everything else: The probability of assigning the current word i to topic j and author k given everything else: number of times word w assigned to topic j number of times topic j assigned to author k

Gibbs sampling procedure

Start with random assignments to topics/authors

Use all previous assignments, except for current word-token

Sample topic and author, and move to next word-token

Collect samples after >1000 iterations

Data Corpora Corpora CiteSeer:160K abstracts, 85K authors CiteSeer:160K abstracts, 85K authors NIPS:1.7K papers, 2K authors NIPS:1.7K papers, 2K authors Enron:115K s, 5K authors (sender) Enron:115K s, 5K authors (sender) Removed stop words; no stemming Removed stop words; no stemming Word order is irrelevant, just use word counts Word order is irrelevant, just use word counts Processing time: Processing time: Nips: 2000 Gibbs iterations  12 hours on PC workstation CiteSeer: 700 Gibbs iterations  111 hours

Four example topics from CiteSeer (T=300)

Four more topics

Some topics relate to generic word usage

Some likely topics per author (CiteSeer) Author = Andrew McCallum, U Mass: Author = Andrew McCallum, U Mass: Topic 1: classification, training, generalization, decision, data,… Topic 1: classification, training, generalization, decision, data,… Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 2: learning, machine, examples, reinforcement, inductive,….. Topic 3: retrieval, text, document, information, content,… Topic 3: retrieval, text, document, information, content,… Author = Hector Garcia-Molina, Stanford: Author = Hector Garcia-Molina, Stanford: - Topic 1: query, index, data, join, processing, aggregate…. - Topic 2: transaction, concurrency, copy, permission, distributed…. - Topic 3: source, separation, paper, heterogeneous, merging….. Author = Paul Cohen, USC/ISI: Author = Paul Cohen, USC/ISI: - Topic 1: agent, multi, coordination, autonomous, intelligent…. - Topic 2: planning, action, goal, world, execution, situation… - Topic 3: human, interaction, people, cognitive, social, natural….

Four example topics from NIPS (T=100)

ENRON two example topics (T=100)

ENRON two topics not about Enron

Stability of Topics Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) Content of topics is arbitrary across runs of model (e.g., topic #1 is not the same across runs) However, However, Majority of topics are stable over processing time Majority of topics are stable over processing time Majority of topics can be aligned across runs Majority of topics can be aligned across runs Topics represent genuine structure in data Topics represent genuine structure in data

Comparing NIPS topics from the same Markov chain KL distance topics at t 1 =1000 Re-ordered topics at t 2 =2000 BEST KL = 0.54 WORST KL = 4.78

Comparing NIPS topics from two different Markov chains KL distance topics from chain 1 Re-ordered topics from chain 2 BEST KL = 1.03 WORST KL = 9.49

Detecting Papers on Unusual Topics for Authors We can calculate perplexity (unusualness) for words in a document given an author We can calculate perplexity (unusualness) for words in a document given an author Papers ranked by perplexity for M. Jordan:

Author Separation A method 1 is described which like the kernel 1 trick 1 in support 1 vector 1 machines 1 SVMs 1 lets us generalize distance 1 based 2 algorithms to operate in feature 1 spaces usually nonlinearly related to the input 1 space This is done by identifying a class of kernels 1 which can be represented as norm 1 based 2 distances 1 in Hilbert spaces It turns 1 out that common kernel 1 algorithms such as SVMs 1 and kernel 1 PCA 1 are actually really distance 1 based 2 algorithms and can be run 2 with that class of kernels 1 too As well as providing 1 a useful new insight 1 into how these algorithms work the present 2 work can form the basis 1 for conceiving new algorithms This paper presents 2 a comprehensive approach for model 2 based 2 diagnosis 2 which includes proposals for characterizing and computing 2 preferred 2 diagnoses 2 assuming that the system 2 description 2 is augmented with a system 2 structure 2 a directed 2 graph 2 explicating the interconnections between system 2 components 2 Specifically we first introduce the notion of a consequence 2 which is a syntactically 2 unconstrained propositional 2 sentence 2 that characterizes all consistency 2 based 2 diagnoses 2 and show 2 that standard 2 characterizations of diagnoses 2 such as minimal conflicts 1 correspond to syntactic 2 variations 1 on a consequence 2 Second we propose a new syntactic 2 variation on the consequence 2 known as negation 2 normal form NNF and discuss its merits compared to standard variations Third we introduce a basic algorithm 2 for computing consequences in NNF given a structured system 2 description We show that if the system 2 structure 2 does not contain cycles 2 then there is always a linear size 2 consequence 2 in NNF which can be computed in linear time 2 For arbitrary 1 system 2 structures 2 we show a precise connection between the complexity 2 of computing 2 consequences and the topology of the underlying system 2 structure 2 Finally we present 2 an algorithm 2 that enumerates 2 the preferred 2 diagnoses 2 characterized by a consequence 2 The algorithm 2 is shown 1 to take linear time 2 in the size 2 of the consequence 2 if the preference criterion 1 satisfies some general conditions Written by (1) Scholkopf_B Written by (2) Darwiche_A Can model attribute words to authors correctly within a document? Test of model: 1) artificially combine abstracts from different authors 2) check whether assignment is to correct original author

Temporal patterns in topics: hot and cold topics We have CiteSeer papers from We have CiteSeer papers from We can calculate time-series for topics We can calculate time-series for topics Hot topics become more prevalent Hot topics become more prevalent Cold topics become less prevalent Cold topics become less prevalent Do time-series correspond with known trends in computer science? Do time-series correspond with known trends in computer science?

Hot Topic: machine learning, data mining

The inevitability of Bayes…

Rise in Web/Mobile topics

(Not so) Hot Topics

Decline in programming languages, OS, ….

Security research reborn….

Decrease in use of Greek Letters Decrease in use of Greek Letters

Burst of French writing in mid 90’s?

Comparison to models that use less information (topics, no authors) (authors, no topics) Topics model Author model

Matrix Factorization Interpretation Authors Topics Documents Words = Documents Authors A Topics Words AUTHOR-TOPIC MODEL Documents Topics Documents Words = Topics Words TOPIC MODEL Documents Words = Documents Authors A Author Words AUTHOR MODEL

Comparison Results Train models on part of a new document and predict remaining words Train models on part of a new document and predict remaining words Without having seen any words from new document, author-topic information helps in predicting words from that document Without having seen any words from new document, author-topic information helps in predicting words from that document Topics model is more flexible in adapting to new document after observing a number of words Topics model is more flexible in adapting to new document after observing a number of words

Author prediction with CiteSeer Task: predict (single) author of new CiteSeer abstracts Task: predict (single) author of new CiteSeer abstracts Results: Results: For 33% of documents, author guessed correctly For 33% of documents, author guessed correctly Median rank of true author = 26 (out of 85,000) Median rank of true author = 26 (out of 85,000)

Perplexities for true author and any random author A = true author A = any author

The Author- Topic Browser (b) (a) (c) Querying on author Pazzani_M Querying on topic relevant to author Querying on document written by author

New Applications/ Future Work Finding relevant Finding relevant "find s similar to this based on content” "find s similar to this based on content” "find people who wrote s similar in content to this one" "find people who wrote s similar in content to this one" Reviewer Recommendation Reviewer Recommendation “Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” “Find reviewers for this set of NSF proposals who are active in relevant topics and have no conflicts of interest” Change Detection/Monitoring Change Detection/Monitoring Which authors are on the leading edge of new topics? Which authors are on the leading edge of new topics? Characterize the “topic trajectory” of this author over time Characterize the “topic trajectory” of this author over time Author Identification Author Identification Who wrote this document? Incorporation of stylistic information Who wrote this document? Incorporation of stylistic information

Comparing NIPS topics and CiteSeer topics KL distance NIPS topics Re-ordered CiteSeer topics KL = 2.88 KL = 4.48 KL = 4.92 KL = 5.0