Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory at The University of Illinois at Urbana-Champaign Large Scale Information Management Multinomial Topic Models: Hard to Interpret (Label) Blei et al. ~lemur/science/topics.htmlhttp:// ~lemur/science/topics.html term relevance weight feedback independence model frequent probabilistic document … Text Collection … Topics word prob data university new results end high research figure analysis number institute … Topic Model term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Topic Unigram language model Multinomial distribution of Terms = = (Multinomial mixture, PLSA, LDA, & lots of extensions) Applications: topic extraction, IR, contextual text mining, opinion analysis… ? Question: Can we automatically generate meaningful labels for topics? Our Method Statistical topic models NLP Chunker Ngram stat. term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2 Good labels = Understandable + Relevant + High Coverage + Discriminative sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality Context-sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc … - Applicable to any task with unigram language model, such as labeling document clusters Results: Sample Topic Labels selectivity estimation Clustering algorithms large data r tree b tree Indexing methods iran contra plane 0.02 air 0.02 flight 0.02 pilot 0.01 crew 0.01 force 0.01 accident 0.01 crash air plane crashes air force iran contra trial Clustering dimension partition algorithm hash Clustering hash dimension algorithm partition Context: SIGMOD Proceedings Topic … … P(w| ) P(w| l 1 ) D( | l 1 ) < D( | l 2 ) Good Label ( l 1 ): “clustering algorithm” Clustering hash dimension join algorithm … Bad Label ( l 2 ): “hash join” P(w| l 2 ) Clustering dimensional algorithm birch shape Latent Topic … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w| ) Modeling the Relevance Zero-order relevance: prefer phrases well covering top words First-order relevance: prefer phrases with similar context (distribution) Bias of using C to estimate l Pointwise mutual information based on C, pre-computed High Coverage inside topic (MMR): Discriminative across topic: Scoring and Re-ranking Labels insulin foraging foragers collected grains loads collection nectar …