Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.
Advertisements

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
One Theme in All Views: Modeling Consensus Topics in Multiple Contexts Jian Tang 1, Ming Zhang 1, Qiaozhu Mei 2 1 School of EECS, Peking University 2 School.
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,
Generative Topic Models for Community Analysis
Search Engines and Information Retrieval
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Xyleme A Dynamic Warehouse for XML Data of the Web.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 14 The Second Component: The Database.
Phrase Mining and Topic Modeling for Structure Discovery from Text
Scalable Text Mining with Sparse Generative Models
Data Mining – Intro.
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign
Overview of Search Engines
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Information Retrieval in Practice
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Search Engines and Information Retrieval Chapter 1.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei.
Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Group A Next Generation Information Access Group.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
A Study of Poisson Query Generation Model for Information Retrieval
Context-Sensitive IR using Implicit Feedback Xuehua Shen, Bin Tan, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Context Analysis in Text Mining and Search
The topic discovery models
Probabilistic Topic Model
A Black-Box Approach to Query Cardinality Estimation
Course Summary (Lecture for CS410 Intro Text Info Systems)
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
The topic discovery models
Applying Key Phrase Extraction to aid Invalidity Search
Data Warehousing and Data Mining
Team Project, Part II NOMO Auto, Part II IST 210 Section 4
The topic discovery models
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Topic: Semantic Text Mining
Presentation transcript:

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign

2 Outline Background: statistical topic models Labeling a topic model –Criteria and challenge Our approach: a probabilistic framework Experiments Summary

3 Statistical Topic Models for Text Mining Text Collections Probabilistic Topic Modeling … web 0.21 search 0.10 link 0.08 graph 0.05 … … Subtopic discovery Opinion comparison Summarization Topical pattern analysis … term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ model 0.03 … Topic models (Multinomial distributions) PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [ Steyvers et al. 04 ] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06]

4 Topic Models: Hard to Interpret Use top words –automatic, but hard to make sense Human generated labels –Make sense, but cannot scale up term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … Retrieval Models Question: Can we automatically generate understandable labels for topics? Term, relevance, weight, feedback insulin foraging foragers collected grains loads collection nectar … ?

5 What is a Good Label? Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term relevance weight feedback independence model frequent probabilistic document … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR

6 Our Method Collection (e.g., SIGIR) term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … NLP Chunker Ngram Stat. information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… 2 Discrimination 3 information retriev retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… 4 Coverage retrieval models 0.20 IR models pseudo feedback 0.09 …… information retrieval 0.01

7 Relevance (Task 2): the Zero-Order Score Intuition: prefer phrases well covering top words Clustering dimensional algorithm birch shape Latent Topic  … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w|  ) p(“clustering”|  ) = 0.4 p(“dimensional”|  ) = 0.3 p(“body”|  ) = p(“shape”|  ) = 0.01 √ > ?

8 Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ) “clustering algorithm” Clustering hash dimension key algorithm … p(w | hash join ) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… l 2 : “hash join” Relevance (Task 2): the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|  ) Score (l,  ) = D(  || l )

9 Discrimination and Coverage (Tasks 3 & 4) Discriminative across topic: –High relevance to target topic, low relevance to other topics High Coverage inside topic: –Use MMR strategy

10 Variations and Applications Labeling document clusters –Document cluster  unigram language model –Applicable to any task with unigram language model Context sensitive labels –Label of a topic is sensitive to the context –An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ? in marketing?

11 Experiments Datasets: –SIGMOD abstracts; SIGIR abstracts; AP news data –Candidate labels: significant bigrams; NLP chunks Topic models: –PLSA, LDA Evaluation: –Human annotators to compare labels generated from anonymous systems –Order of systems randomly perturbed; score average over all sample topics

12 Result Summary Automatic phrase labels >> top words 1-order relevance >> 0-order relevance Bigram > NLP chunks –Bigram works better with literature; NLP better with news System labels << human labels –Scientific literature is an easier task

13 Results: Sample Topic Labels tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality clustering algorithm clustering structure … large data, data quality, high data, data application, … iran contra … r tree b tree … indexing methods

14 Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining

15 Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels –understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: –Labeling hierarchical topic models –Incorporating priors

16 Thanks! - Please come to our poster tonight (#40)

17 Multinomial Topic Models – Blei et al. ~lemur/science/topics.htmlhttp:// ~lemur/science/topics.html term relevance weight feedback independence model frequent probabilistic document … Text Collection … Topics word prob data university new results end high research figure analysis number institute … Topic Model term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Topic Unigram language model Multinomial distribution of Terms = = (Multinomial mixture, PLSA, LDA, & lots of extensions) Applications: topic extraction, IR, contextual text mining, opinion analysis…

18 Multinomial Topic Models Statistic topic models –Multinomial mixture, PLSA, LDA, a lot of extensions. Applications –topic extraction; –information retrieval; –contextual text mining; –opinion extraction A common problem: –Hard to interpret (label topics) pollen 0.46 foraging 0.04 foragers 0.04 collected 0.03 grains 0.03 loads 0.03 collection 0.02 nectar 0.02 … pollen glucose mice diabetes hormone body weight fat … ?

19 Overview Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term relevance weight feedback independence model frequent probabilistic document … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR

20 Our Method Statistical topic models NLP Chunker Ngram stat. term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2

21 Clustering hash dimension key algorithm … Bad Label ( l 2 ): “hash join” p(w | hash join ) Relevance: the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic  … P(w|  ) D(  | clustering algorithm ) < D(  | hash join ) SIGMOD Proceedings Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ): “clustering algorithm” Score (l,  )

22 Our Method Guarantee understandability with a pre- processing step –Use phrases as candidate topic labels –NLP Chunks / statistically significant Ngrams A ranking problem: satisfy relevance, coverage, and discriminability with a probabilistic framework Good labels = Understandable + Relevant + High Coverage + Discriminative

23 Results: Contextual-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc …

24 Results: Sample Topic Labels sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality clustering algorithm clustering structure … large data, data quality, high data, data application, … selectivity estimation … iran contra … r tree b tree … indexing methods

25 Preliminary Results (SIGMOD) Sensor networks constraint sensor assert index integrity network procedure View maintenance view update warehouse copy array directory increment Query languages language relate relational model extension semantic definition Recursive queries recursion algebra b-tree rule general relate nest Concurrency contr. transact concurrent control protocol lock replicate distribute Clustering algo. cluster spatial join algorithm dimension dataset mine Query optimizers optimize query plan execution join statistic estimate Graphic interface graph visual multimedia browse graphic transitive interface 0.013

26 Preliminary Results (SIGMODII) Client-server file serve client grid message policy storage Knowledge base dependency 0.06 schema knowledge function rule form extract Data cube cube rank db aggregate dimension search framework XML data xml document 0.07 query xquery temporal twig element Stream manage. stream parallel process continuous partition resource physical Information src. web integrate service source enterprise business wrap Declarative lang. workflow system language path database constraint integrity Index structures tree index node r-tree b structure main 0.015