Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign
Outline Background: statistical topic models Labeling a topic model Criteria and challenge Our approach: a probabilistic framework Experiments Summary
Statistical Topic Models for Text Mining term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ. 0.03 model 0.03 … Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling Subtopic discovery Opinion comparison Summarization Topical pattern analysis … PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06] … web 0.21 search 0.10 link 0.08 graph 0.05 … …
Topic Models: Hard to Interpret Use top words automatic, but hard to make sense Human generated labels Make sense, but cannot scale up term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … insulin foraging foragers collected grains loads collection nectar … Term, relevance, weight, feedback ? Retrieval Models Question: Can we automatically generate understandable labels for topics?
What is a Good Label? iPod Nano じょうほうけんさく Pseudo-feedback Retrieval models Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … iPod Nano じょうほうけんさく Pseudo-feedback Mei and Zhai 06: a topic in SIGIR Information Retrieval
Our Method 1 3 2 4 NLP Chunker Ngram Stat. Collection information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Collection (e.g., SIGIR) term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … Discrimination 3 information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… 2 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01 filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 …
Relevance (Task 2): the Zero-Order Score Intuition: prefer phrases well covering top words Clustering p(“clustering”|) = 0.4 Good Label (l1): “clustering algorithm” √ p(“dimensional”|) = 0.3 dimensional ? algorithm > Latent Topic … birch p(“shape”|) = 0.01 shape Bad Label (l2): “body shape” … p(w|) body p(“body”|) = 0.001
Relevance (Task 2): the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic … P(w|) Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label (l1) “clustering algorithm” l2: “hash join” Clustering hash dimension key algorithm … p(w | hash join) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… Score (l, ) = D(||l)
Discrimination and Coverage (Tasks 3 & 4) Discriminative across topic: High relevance to target topic, low relevance to other topics High Coverage inside topic: Use MMR strategy
Variations and Applications Labeling document clusters Document cluster unigram language model Applicable to any task with unigram language model Context sensitive labels Label of a topic is sensitive to the context An alternative way to approach contextual text mining tree, prune, root, branch “tree algorithms” in CS ? in horticulture ? in marketing?
Experiments Datasets: Topic models: Evaluation: SIGMOD abstracts; SIGIR abstracts; AP news data Candidate labels: significant bigrams; NLP chunks Topic models: PLSA, LDA Evaluation: Human annotators to compare labels generated from anonymous systems Order of systems randomly perturbed; score average over all sample topics
Result Summary Automatic phrase labels >> top words 1-order relevance >> 0-order relevance Bigram > NLP chunks Bigram works better with literature; NLP better with news System labels << human labels Scientific literature is an easier task
Results: Sample Topic Labels north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 iran contra … clustering algorithm clustering structure … tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 r tree b tree … large data, data quality, high data, data application, … indexing methods
Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining
Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: Labeling hierarchical topic models Incorporating priors
Thanks!