Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign
2 Outline Background: statistical topic models Labeling a topic model –Criteria and challenge Our approach: a probabilistic framework Experiments Summary
3 Statistical Topic Models for Text Mining Text Collections Probabilistic Topic Modeling … web 0.21 search 0.10 link 0.08 graph 0.05 … … Subtopic discovery Opinion comparison Summarization Topical pattern analysis … term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independ model 0.03 … Topic models (Multinomial distributions) PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [ Steyvers et al. 04 ] CPLSA [Mei & Zhai 06] … Pachinko allocation [Li & McCallum 06] Topic over time [Wang et al. 06]
4 Topic Models: Hard to Interpret Use top words –automatic, but hard to make sense Human generated labels –Make sense, but cannot scale up term 0.16 relevance 0.08 weight 0.07 feedback 0.04 independence 0.03 model 0.03 frequent 0.02 probabilistic 0.02 document 0.02 … Retrieval Models Question: Can we automatically generate understandable labels for topics? Term, relevance, weight, feedback insulin foraging foragers collected grains loads collection nectar … ?
5 What is a Good Label? Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term relevance weight feedback independence model frequent probabilistic document … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR
6 Our Method Collection (e.g., SIGIR) term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … NLP Chunker Ngram Stat. information retrieval, retrieval model, index structure, relevance feedback, … Candidate label pool 1 Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… 2 Discrimination 3 information retriev retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… 4 Coverage retrieval models 0.20 IR models pseudo feedback 0.09 …… information retrieval 0.01
7 Relevance (Task 2): the Zero-Order Score Intuition: prefer phrases well covering top words Clustering dimensional algorithm birch shape Latent Topic … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w| ) p(“clustering”| ) = 0.4 p(“dimensional”| ) = 0.3 p(“body”| ) = p(“shape”| ) = 0.01 √ > ?
8 Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ) “clustering algorithm” Clustering hash dimension key algorithm … p(w | hash join ) key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… l 2 : “hash join” Relevance (Task 2): the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic … P(w| ) Score (l, ) = D( || l )
9 Discrimination and Coverage (Tasks 3 & 4) Discriminative across topic: –High relevance to target topic, low relevance to other topics High Coverage inside topic: –Use MMR strategy
10 Variations and Applications Labeling document clusters –Document cluster unigram language model –Applicable to any task with unigram language model Context sensitive labels –Label of a topic is sensitive to the context –An alternative way to approach contextual text mining tree, prune, root, branch “tree algorithms” in CS ? in horticulture ? in marketing?
11 Experiments Datasets: –SIGMOD abstracts; SIGIR abstracts; AP news data –Candidate labels: significant bigrams; NLP chunks Topic models: –PLSA, LDA Evaluation: –Human annotators to compare labels generated from anonymous systems –Order of systems randomly perturbed; score average over all sample topics
12 Result Summary Automatic phrase labels >> top words 1-order relevance >> 0-order relevance Bigram > NLP chunks –Bigram works better with literature; NLP better with news System labels << human labels –Scientific literature is an easier task
13 Results: Sample Topic Labels tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality clustering algorithm clustering structure … large data, data quality, high data, data application, … iran contra … r tree b tree … indexing methods
14 Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; distributed retrieval; parameter estimation; mixture models; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) Explore the different meaning of a topic with different contexts (content switch) An alternative approach to contextual text mining
15 Summary Labeling: A postprocessing step of all multinomial topic models A probabilistic approach to generate good labels –understandable, relevant, high coverage, discriminative Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive Future work: –Labeling hierarchical topic models –Incorporating priors
16 Thanks! - Please come to our poster tonight (#40)
17 Multinomial Topic Models – Blei et al. ~lemur/science/topics.htmlhttp:// ~lemur/science/topics.html term relevance weight feedback independence model frequent probabilistic document … Text Collection … Topics word prob data university new results end high research figure analysis number institute … Topic Model term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Topic Unigram language model Multinomial distribution of Terms = = (Multinomial mixture, PLSA, LDA, & lots of extensions) Applications: topic extraction, IR, contextual text mining, opinion analysis…
18 Multinomial Topic Models Statistic topic models –Multinomial mixture, PLSA, LDA, a lot of extensions. Applications –topic extraction; –information retrieval; –contextual text mining; –opinion extraction A common problem: –Hard to interpret (label topics) pollen 0.46 foraging 0.04 foragers 0.04 collected 0.03 grains 0.03 loads 0.03 collection 0.02 nectar 0.02 … pollen glucose mice diabetes hormone body weight fat … ?
19 Overview Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics … term relevance weight feedback independence model frequent probabilistic document … iPod Nano Pseudo-feedback Information Retrieval Retrieval models じょうほうけんさく – Mei and Zhai 06: a topic in SIGIR
20 Our Method Statistical topic models NLP Chunker Ngram stat. term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2
21 Clustering hash dimension key algorithm … Bad Label ( l 2 ): “hash join” p(w | hash join ) Relevance: the First-Order Score Intuition: prefer phrases with similar context (distribution) Clustering dimension partition algorithm hash Topic … P(w| ) D( | clustering algorithm ) < D( | hash join ) SIGMOD Proceedings Clustering hash dimension algorithm partition … p(w | clustering algorithm ) Good Label ( l 1 ): “clustering algorithm” Score (l, )
22 Our Method Guarantee understandability with a pre- processing step –Use phrases as candidate topic labels –NLP Chunks / statistically significant Ngrams A ranking problem: satisfy relevance, coverage, and discriminability with a probabilistic framework Good labels = Understandable + Relevant + High Coverage + Discriminative
23 Results: Contextual-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc …
24 Results: Sample Topic Labels sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality clustering algorithm clustering structure … large data, data quality, high data, data application, … selectivity estimation … iran contra … r tree b tree … indexing methods
25 Preliminary Results (SIGMOD) Sensor networks constraint sensor assert index integrity network procedure View maintenance view update warehouse copy array directory increment Query languages language relate relational model extension semantic definition Recursive queries recursion algebra b-tree rule general relate nest Concurrency contr. transact concurrent control protocol lock replicate distribute Clustering algo. cluster spatial join algorithm dimension dataset mine Query optimizers optimize query plan execution join statistic estimate Graphic interface graph visual multimedia browse graphic transitive interface 0.013
26 Preliminary Results (SIGMODII) Client-server file serve client grid message policy storage Knowledge base dependency 0.06 schema knowledge function rule form extract Data cube cube rank db aggregate dimension search framework XML data xml document 0.07 query xquery temporal twig element Stream manage. stream parallel process continuous partition resource physical Information src. web integrate service source enterprise business wrap Declarative lang. workflow system language path database constraint integrity Index structures tree index node r-tree b structure main 0.015