Download presentation
Presentation is loading. Please wait.
Published byDouglas Willis Henry Modified over 9 years ago
1
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai 1 4 6 3 2 5 1 DAIS The Database and Information Systems Laboratory at The University of Illinois at Urbana-Champaign Large Scale Information Management Multinomial Topic Models: Hard to Interpret (Label) Blei et al. http://www.cs.cmu.edu/ ~lemur/science/topics.htmlhttp://www.cs.cmu.edu/ ~lemur/science/topics.html term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Text Collection … Topics word prob data 0.0358 university 0.0132 new 0.0119 results 0.0119 end 0.0116 high 0.0098 research 0.0096 figure 0.0089 analysis 0.0076 number 0.0073 institute 0.0072 … Topic Model term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Topic Unigram language model Multinomial distribution of Terms = = (Multinomial mixture, PLSA, LDA, & lots of extensions) Applications: topic extraction, IR, contextual text mining, opinion analysis… ? Question: Can we automatically generate meaningful labels for topics? Our Method Statistical topic models NLP Chunker Ngram stat. term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2 Good labels = Understandable + Relevant + High Coverage + Discriminative sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality 0.005 Context-sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc … - Applicable to any task with unigram language model, such as labeling document clusters Results: Sample Topic Labels selectivity estimation Clustering algorithms large data r tree b tree Indexing methods iran contra plane 0.02 air 0.02 flight 0.02 pilot 0.01 crew 0.01 force 0.01 accident 0.01 crash 0.005 air plane crashes air force iran contra trial Clustering dimension partition algorithm hash Clustering hash dimension algorithm partition Context: SIGMOD Proceedings Topic … … P(w| ) P(w| l 1 ) D( | l 1 ) < D( | l 2 ) Good Label ( l 1 ): “clustering algorithm” Clustering hash dimension join algorithm … Bad Label ( l 2 ): “hash join” P(w| l 2 ) Clustering dimensional algorithm birch shape Latent Topic … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w| ) Modeling the Relevance Zero-order relevance: prefer phrases well covering top words First-order relevance: prefer phrases with similar context (distribution) Bias of using C to estimate l Pointwise mutual information based on C, pre-computed High Coverage inside topic (MMR): Discriminative across topic: Scoring and Re-ranking Labels insulin foraging foragers collected grains loads collection nectar …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.