Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai 1 4 6 3 2 5 1 DAIS The Database and Information Systems Laboratory.

Slides:

Advertisements

Similar presentations

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Advertisements

ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13.

Statistical Topic Modeling part 1

MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics.

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Latent Dirichlet Allocation a generative model for text

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Phrase Mining and Topic Modeling for Structure Discovery from Text

Scalable Text Mining with Sparse Generative Models

Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign

Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Topic Extraction from Biology Literature: Prior, Labeling, and Switching Qiaozhu Mei.

Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei †, Yi Zhang ‡ Presented by Jessica Gronski ‡ † University of Illinois at Urbana-Champaign.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.

A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.

1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan.

Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.

1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.

Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Chapter 23: Probabilistic Language Models April 13, 2004.

Latent Dirichlet Allocation

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Link Distribution on Wikipedia [0407]KwangHee Park.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Automatic Labeling of Multinomial Topic Models

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Relevance Feedback Hongning Wang

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.

Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

A Study of Poisson Query Generation Model for Information Retrieval

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Hierarchical Clustering & Topic Models

Context Analysis in Text Mining and Search

Overview of Statistical Language Models

Probabilistic Topic Model

Course Summary (Lecture for CS410 Intro Text Info Systems)

Relevance Feedback Hongning Wang

The topic discovery models

Bayesian Inference for Mixture Language Models

Junghoo “John” Cho UCLA

Topic Models in Text Processing

Language Models for TR Rong Jin

Presentation transcript:

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory at The University of Illinois at Urbana-Champaign Large Scale Information Management Multinomial Topic Models: Hard to Interpret (Label) Blei et al. ~lemur/science/topics.htmlhttp:// ~lemur/science/topics.html term relevance weight feedback independence model frequent probabilistic document … Text Collection … Topics word prob data university new results end high research figure analysis number institute … Topic Model term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Topic Unigram language model Multinomial distribution of Terms = = (Multinomial mixture, PLSA, LDA, & lots of extensions) Applications: topic extraction, IR, contextual text mining, opinion analysis… ? Question: Can we automatically generate meaningful labels for topics? Our Method Statistical topic models NLP Chunker Ngram stat. term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … term relevance weight feedback independence model frequent probabilistic document … Multinomial topic models database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure … Candidate label pool Collection (Context) Ranked List of Labels clustering algorithm; distance measure; … Relevance Score Re-ranking Coverage; Discrimination 1 2 Good labels = Understandable + Relevant + High Coverage + Discriminative sampling 0.06 estimation 0.04 approximate 0.04 histograms 0.03 selectivity 0.03 histogram 0.02 answers 0.02 accurate 0.02 tree 0.09 trees 0.08 spatial 0.08 b 0.05 r 0.04 disk 0.02 array 0.01 cache 0.01 north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh reagan charges the, of, a, and, to, data, > 0.02 … clustering 0.02 time 0.01 clusters 0.01 databases 0.01 large 0.01 performance 0.01 quality Context-sensitive Labeling sampling estimation approximation histogram selectivity histograms … selectivity estimation; random sampling; approximate answers; multivalue dependency functional dependency Iceberg cube distributed retrieval; parameter estimation; mixture models; term dependency; independence assumption; Context: Database (SIGMOD Proceedings) Context: IR (SIGIR Proceedings) dependencies functional cube multivalued iceberg buc … - Applicable to any task with unigram language model, such as labeling document clusters Results: Sample Topic Labels selectivity estimation Clustering algorithms large data r tree b tree Indexing methods iran contra plane 0.02 air 0.02 flight 0.02 pilot 0.01 crew 0.01 force 0.01 accident 0.01 crash air plane crashes air force iran contra trial Clustering dimension partition algorithm hash Clustering hash dimension algorithm partition Context: SIGMOD Proceedings Topic  … … P(w|  ) P(w| l 1 ) D(  | l 1 ) < D(  | l 2 ) Good Label ( l 1 ): “clustering algorithm” Clustering hash dimension join algorithm … Bad Label ( l 2 ): “hash join” P(w| l 2 ) Clustering dimensional algorithm birch shape Latent Topic  … Good Label ( l 1 ): “clustering algorithm” body Bad Label ( l 2 ): “body shape” … p(w|  ) Modeling the Relevance Zero-order relevance: prefer phrases well covering top words First-order relevance: prefer phrases with similar context (distribution) Bias of using C to estimate l Pointwise mutual information based on C, pre-computed High Coverage inside topic (MMR): Discriminative across topic: Scoring and Re-ranking Labels insulin foraging foragers collected grains loads collection nectar …