A Cross-Collection Mixture Model for Comparative Text Mining

Slides:

Advertisements

Similar presentations

Lecture 2 - Revenue Models

Advertisements

CGU Padlock An innovative insurance solution for the Commercial Property Owner.

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Broadband Session Michael Byrne. Broadband Map Technical Details Data Integration Map Presentation Since Launch.

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.

Anatomy of Aggregate Collections: The Example of Google Print for Libraries Brian Lavoie Senior Research Scientist OCLC Research OCLC Members Council Meeting.

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.

Finger Gesture Recognition through Sweep Sensor Pong C Yuen 1, W W Zou 1, S B Zhang 1, Kelvin K F Wong 2 and Hoson H S Lam 2 1 Department of Computer Science.

1 Implementing Internet Web Sites in Counseling and Career Development James P. Sampson, Jr. Florida State University Copyright 2003 by James P. Sampson,

1. No. It is an informal agreement between TEC Services and the IT Section of a college or department Simply, it is an agreement that TEC Services will.

Introduction Lesson 1 Microsoft Office 2010 and the Internet

Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

A probabilistic model for retrospective news event detection

Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign.

ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.

Active Feedback: UIUC TREC 2003 HARD Track Experiments Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Pattern Finding and Pattern Discovery in Time Series

Stage Gate - Lecture 11 Stage Gate – Lecture 1 Technology Development © 2009 ~ Mark Polczynski.

ACM CIKM 2008, Oct , Napa Valley 1 Mining Term Association Patterns from Search Logs for Effective Query Reformulation Xuanhui Wang and ChengXiang.

Computing ESSENTIALS CHAPTER Copyright 2003 The McGraw-Hill Companies, Inc.Copyright 2003 The McGraw-Hill Companies, Inc Secondary Storage computing.

1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.

August 16, 2014 Modeling the Performance of Wireless Sensor Networks Carla Fabiana Chiasserini Michele Garetto Telecommunication Networks Group Politecnico.

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.

2004 EBSCO Publishing Presentation on EBSCOadmin.

Fundamentals of Cost Analysis for Decision Making

Marketing Strategy and the Marketing Plan

Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic Jin Young Kim*, Kevyn Collins-Thompson, Paul Bennett and Susan.

Adaptive Segmentation Based on a Learned Quality Metric

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.

Mixture Language Models and EM Algorithm

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Powers of Media. Instructor Led Strength Fast to build Easy to deliver Customize on fly Weakness Consistency Linear No rewind.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Chapter 8 The Marketing Plan

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,

Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.

MKT 202, Taufique Hossain Strategic Planning for Competitive Advantage.

Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.

Annotating Gene List From Literature Xin He Department of Computer Science UIUC.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.

NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Probabilistic Topic Model

Hidden Markov Models (HMMs)

Course Summary (Lecture for CS410 Intro Text Info Systems)

Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Hidden Markov Models (HMMs)

Course Summary ChengXiang “Cheng” Zhai Department of Computer Science

Chapter 8 The Marketing Plan

Inductive Clustering: A technique for clustering search results Hieu Khac Le Department of Computer Science - University of Illinois at Urbana-Champaign.

Presentation transcript:

A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai1 Atulya Velivelli2 Bei Yu3 1Department of Computer Science 2Department of Electrical and Computer Engineering 3Graduate School of Library and Information Science University of Illinois, Urbana-Champaign U.S.A.

Motivation Many applications involve a comparative analysis of several comparable text collections, e.g., Given news articles from different sources (about the same event), can we extract what is common to all the sources and what is unique to one specific souce? Given customer reviews about 3 different brands of laptops, can we extract the common themes (e.g., battery life, speed, warranty) and compare the three brands in terms of each common theme? Given web sites about companies selling similar products, can we analyze the strength/weakness of each company? Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis We aim at developing methods for comparing multiple collections of text and performing comparative text mining

A pool of text Collections Comparative Text Mining (CTM) Problem definition: Given a comparable set of text collections Discover & analyze their common and unique properties A pool of text Collections Collection C1 Collection C2 …. Collection Ck Common themes C1- specific themes C2- specific themes Ck- specific themes

Example: Summarizing Customer Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Ideal results from comparative text mining

A More Realistic Setup of CTM IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery 0.129 Hours 0.080 Life 0.060 … Long 0.120 4hours 0.010 3hours 0.008 Reasonable 0.10 Medium 0.08 2hours 0.002 Short 0.05 Poor 0.01 1hours 0.005 .. Disk 0.015 IDE 0.010 Drive 0.005 Large 0.100 80GB 0.050 Small 0.050 5GB 0.030 ... Medium 0.123 20GB 0.080 …. Pentium 0.113 Processor 0.050 Slow 0.114 200Mhz 0.080 Fast 0.151 3Ghz 0.100 Moderate 0.116 1Ghz 0.070 Common Word Distr. Collection-specific Word Distributions

A Basic Approach: Simple Clustering Pool all documents together and perform clustering Hopefully, some clusters are reflecting common themes and others specific themes However, we can’t “force” a common theme to cover all collections Background B Theme 1 1 Theme 3 3 Theme 2 2 Theme 4 4 …………………

Improved Clustering: Cross-Collection Mixture Models Cm Explicitly distinguish and model common themes and specific themes Fit a mixture model with the text data Estimate parameters using EM Clusters are more meaningful Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m

Details of the Mixture Model Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 1,i W 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Theme k k,i Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood 1-C Collection-specific Distr.

Experiments Two Data Sets War news (2 collections) Iraq war: A combination of 30 articles from CNN and BBC websites Afghan war: A combination of 26 articles from CNN and BBC websites Laptop customer reviews (3 collections) Apple iBook Mac: 34 reviews downloaded from epinions.com Dell Inspiron: 22 reviews downloaded from epinions.com IBM Thinkpad: 42 reviews downloaded from epinions.com On each data set, we compare a simple mixture model with the cross-collection mixture model

Comparison of Simple and Cross-Collection Clustering Results from Cross-collection clustering are more meaningful

Cross-Collection Clustering Results (Laptop Reviews) Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents

Summary and Future Work We defined a new text mining problem, referred to as comparative text mining (CTM), which has many applications We proposed and evaluated a cross-collection mixture model for CTM Experiment results show that the proposed cross-collection model is more effective for CTM than a simple mixture model for CTM Future work Further improve the mixture model and estimation method (e.g., consider proximity, MAP estimation) Use the model to segment documents and create hyperlinks between segments (e.g., feed the learned word distributions into HMMs for segmentation)