A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai1 Atulya Velivelli2 Bei Yu3 1Department of Computer Science 2Department of Electrical and Computer Engineering 3Graduate School of Library and Information Science University of Illinois, Urbana-Champaign U.S.A.
Motivation Many applications involve a comparative analysis of several comparable text collections, e.g., Given news articles from different sources (about the same event), can we extract what is common to all the sources and what is unique to one specific souce? Given customer reviews about 3 different brands of laptops, can we extract the common themes (e.g., battery life, speed, warranty) and compare the three brands in terms of each common theme? Given web sites about companies selling similar products, can we analyze the strength/weakness of each company? Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis We aim at developing methods for comparing multiple collections of text and performing comparative text mining
A pool of text Collections Comparative Text Mining (CTM) Problem definition: Given a comparable set of text collections Discover & analyze their common and unique properties A pool of text Collections Collection C1 Collection C2 …. Collection Ck Common themes C1- specific themes C2- specific themes Ck- specific themes
Example: Summarizing Customer Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Ideal results from comparative text mining
A More Realistic Setup of CTM IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery 0.129 Hours 0.080 Life 0.060 … Long 0.120 4hours 0.010 3hours 0.008 Reasonable 0.10 Medium 0.08 2hours 0.002 Short 0.05 Poor 0.01 1hours 0.005 .. Disk 0.015 IDE 0.010 Drive 0.005 Large 0.100 80GB 0.050 Small 0.050 5GB 0.030 ... Medium 0.123 20GB 0.080 …. Pentium 0.113 Processor 0.050 Slow 0.114 200Mhz 0.080 Fast 0.151 3Ghz 0.100 Moderate 0.116 1Ghz 0.070 Common Word Distr. Collection-specific Word Distributions
A Basic Approach: Simple Clustering Pool all documents together and perform clustering Hopefully, some clusters are reflecting common themes and others specific themes However, we can’t “force” a common theme to cover all collections Background B Theme 1 1 Theme 3 3 Theme 2 2 Theme 4 4 …………………
Improved Clustering: Cross-Collection Mixture Models Cm Explicitly distinguish and model common themes and specific themes Fit a mixture model with the text data Estimate parameters using EM Clusters are more meaningful Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m
Details of the Mixture Model Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 1,i W 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Theme k k,i Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood 1-C Collection-specific Distr.
Experiments Two Data Sets War news (2 collections) Iraq war: A combination of 30 articles from CNN and BBC websites Afghan war: A combination of 26 articles from CNN and BBC websites Laptop customer reviews (3 collections) Apple iBook Mac: 34 reviews downloaded from epinions.com Dell Inspiron: 22 reviews downloaded from epinions.com IBM Thinkpad: 42 reviews downloaded from epinions.com On each data set, we compare a simple mixture model with the cross-collection mixture model
Comparison of Simple and Cross-Collection Clustering Results from Cross-collection clustering are more meaningful
Cross-Collection Clustering Results (Laptop Reviews) Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents
Summary and Future Work We defined a new text mining problem, referred to as comparative text mining (CTM), which has many applications We proposed and evaluated a cross-collection mixture model for CTM Experiment results show that the proposed cross-collection model is more effective for CTM than a simple mixture model for CTM Future work Further improve the mixture model and estimation method (e.g., consider proximity, MAP estimation) Use the model to segment documents and create hyperlinks between segments (e.g., feed the learned word distributions into HMMs for segmentation)