Automatic Labeling of Multinomial Topic Models

Slides:



Advertisements
Similar presentations
ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.
Advertisements

Alexander Kotov and ChengXiang Zhai University of Illinois at Urbana-Champaign.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Statistical Topic Modeling part 1
MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Generative Topic Models for Community Analysis
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Latent Dirichlet Allocation a generative model for text
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Phrase Mining and Topic Modeling for Structure Discovery from Text
Scalable Text Mining with Sparse Generative Models
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei.
Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Probabilistic Topic Models
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Link Distribution on Wikipedia [0407]KwangHee Park.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
A Study of Poisson Query Generation Model for Information Retrieval
Context-Sensitive IR using Implicit Feedback Xuehua Shen, Bin Tan, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Hierarchical Clustering & Topic Models
Context Analysis in Text Mining and Search
The topic discovery models
Probabilistic Topic Model
Course Summary (Lecture for CS410 Intro Text Info Systems)
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
The topic discovery models
Applying Key Phrase Extraction to aid Invalidity Search
The topic discovery models
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Presentation transcript:

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign

Outline Background: statistical topic models Labeling a topic model Criteria and challenge Our approach: a probabilistic framework Experiments Summary

Statistical Topic Models for Text Mining term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ. 0.03 model 0.03 … Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling Subtopic discovery Opinion comparison Summarization Topical pattern analysis … PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06] … web 0.21 search 0.10 link 0.08 graph 0.05 … …

Topic Models: Hard to Interpret Use top words automatic, but hard to make sense Human generated labels Make sense, but cannot scale up term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … insulin foraging foragers collected grains loads collection nectar … Term, relevance, weight, feedback ? Retrieval Models Question: Can we automatically generate understandable labels for topics?

What is a Good Label? iPod Nano じょうほうけんさく Pseudo-feedback Retrieval models Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … iPod Nano じょうほうけんさく Pseudo-feedback Mei and Zhai 06: a topic in SIGIR Information Retrieval

Our Method 1 3 2 4 NLP Chunker Ngram Stat. Collection information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Collection (e.g., SIGIR) term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … Discrimination 3 information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… 2 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01 filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 …

Relevance (Task 2): the Zero-Order Score Intuition: prefer phrases well covering top words Clustering p(“clustering”|) = 0.4 Good Label (l1): “clustering algorithm” √ p(“dimensional”|) = 0.3 dimensional ? algorithm > Latent Topic  … birch p(“shape”|) = 0.01 shape Bad Label (l2): “body shape” … p(w|) body p(“body”|) = 0.001

Relevance (Task 2): the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|) Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label (l1) “clustering algorithm” l2: “hash join” Clustering hash dimension key algorithm … p(w | hash join) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… Score (l,  ) = D(||l)

Discrimination and Coverage (Tasks 3 & 4) Discriminative across topic: High relevance to target topic, low relevance to other topics High Coverage inside topic: Use MMR strategy

Variations and Applications Labeling document clusters Document cluster  unigram language model Applicable to any task with unigram language model Context sensitive labels Label of a topic is sensitive to the context An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ? in marketing?

Experiments Datasets: Topic models: Evaluation: SIGMOD abstracts; SIGIR abstracts; AP news data Candidate labels: significant bigrams; NLP chunks Topic models: PLSA, LDA Evaluation: Human annotators to compare labels generated from anonymous systems Order of systems randomly perturbed; score average over all sample topics

Result Summary Automatic phrase labels >> top words 1-order relevance >> 0-order relevance Bigram > NLP chunks Bigram works better with literature; NLP better with news System labels << human labels Scientific literature is an easier task

Results: Sample Topic Labels north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 iran contra … clustering algorithm clustering structure … tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 r tree b tree … large data, data quality, high data, data application, … indexing methods

Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining

Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: Labeling hierarchical topic models Incorporating priors

Thanks!