Document Clustering Matt Hughes
Document Clustering: What is it? The categorization, or clustering of documents based on term frequency and other relevancy measures Breaks down huge linear results into manageable sets
You use document clusters all the time: Table of contents Yahoo Human categorized; search is not true document clustering “More like this” Suggest a term
The Joke
DC is Human-centered searching Poor search skills Make the Web accessible for all people Represents the way we think; mirrors our brain (not hierarchical, but overlapping grouping) The answer to information overload
DC: Discover new patterns Visual representations allow user to see entire results on one page See patterns between sets Customer service (IBM) Gene mapping Stock market Domain independent; works with any topic Also supply your own data vocabulary
DC: Decrease time-to-result Google search for “Penn State” returns 1,400,000 results Can only display max of 100 results on one page; 14,000 pages to see all results Document clustering Filters duplicates Categorizes; find what you need in two or three pages
The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979 Marti Hearst UCB SIMS, Fall 98
“Berry-Picking” as an Information Seeking Strategy (Bates 90) Standard IR model The information need remains the same throughout the search session. Goal is to produce a perfect set of relevant docs. Berry-picking model (the Web) The query is continually shifting. Users may move through a variety of sources. New information may yield new ideas and new directions. The value of search is on the bits and pieces picked up along the way. Marti Hearst UCB SIMS, Fall 98
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 90) Q2 Q4 Q3 Q1 Q5 Q0 Marti Hearst UCB SIMMS, Fall 98
Problems with Document Clustering Variability in the quality of results Can be improved by providing a vocabulary Not good at differentiating homogenous collections Currently slower than linear, “Pagerank”-like technologies
What has been done so far? Visual DC maps 2D mapping 3D mapping Traditional DC search engines Customer service Automated content creation http://news.google.com
2D DC Mapping: Webbrain.com
2D DC Mapping: Smartmoney.com
3D Clustering (image from Wise et al 95)
Traditional DC Search Engines http://www.vivisimo.com http://www.infonetware.com
Customer Service Mapping (IBM)
How does Document Clustering work?
Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others; i.e, filters out duplicates, redundant documents Marti Hearst UCB SIMS, Fall ‘98
Document Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall ‘98
Document Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall ‘98
Document/Document Matrix Marti Hearst UCB SIMS, Fall ‘98
Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall ‘98
Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall ‘98
Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall ‘98
K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering take a small sample group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary Marti Hearst UCB SIMS, Fall ‘98
Category Labels Advantages: Disadvantages Interpretable Capture summary information Describe multiple facets of content Domain dependent, and so descriptive Disadvantages Do not scale well (for organizing documents) Domain dependent, so costly to acquire May mis-match users’ interests Marti Hearst UCB SIMS, Fall 98
What to do next? Visual interfaces Add as last step for Google Must be standardized as other search engines have been Add as last step for Google Give people options on how to search
Works Cited “Lightweight Document Clustering.” Sholom M. Weiss, Brian F. White, and Chidanand V. Apte; IBM T.J. Watson Research Center “SIMS 296a-3: UI Background.” Marti Hearst
Any questions?