Document Clustering Matt Hughes.

Slides:



Advertisements
Similar presentations
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Advertisements

Unsupervised learning
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
SIMS 213: User Interface Design & Development Marti Hearst Thurs, March 3, 2005.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval February 24, 2004
Interfaces for Selecting and Understanding Collections.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
INFM 700: Session 7 Unstructured Information (Part II) Jimmy Lin The iSchool University of Maryland Monday, March 10, 2008 This work is licensed under.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
SIMS 296a-3: UI Background Marti Hearst Fall ‘98.
A metadata-based approach Marti Hearst Associate Professor BT Visit August 18, 2005.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
ISP 433/633 Week 12 User Interface in IR. Why care about User Interface in IR Human Search using IR depends on –Search in IR and search in human memory.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Interaction LBSC 734 Module 4 Doug Oard. Agenda Where interaction fits Query formulation Selection part 1: Snippets  Selection part 2: Result sets Examination.
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Document Clustering for Natural Language Dialogue-based IR (Google for the Blind) Antoine Raux IR Seminar and Lab Fall 2003 Initial Presentation.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
DARE: Domain analysis and reuse environment Minwoo Hong William Frakes, Ruben Prieto-Diaz and Christopher Fox Annals of Software Engineering,
Search Engine Optimization
Automated Information Retrieval
Information Retrieval in Practice
Unsupervised Learning
Plan for Today’s Lecture(s)
Search Engine Architecture
Clustering of Web pages
What is Information Retrieval (IR)?
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Federated & Meta Search
Database Vocabulary Terms.
Text & Web Mining 9/22/2018.
K-means and Hierarchical Clustering
Information Retrieval
Visualizing Document Collections
Benchmark Series Microsoft Word 2016 Level 2
Web Mining Department of Computer Science and Engg.
Panagiotis G. Ipeirotis Luis Gravano
CHAPTER 7: Information Visualization
Attributes and Values Describing Entities.
Web Mining Research: A Survey
Charts A chart is a graphic or visual representation of data
Retrieval Performance Evaluation - Measures
Presented By: Grant Glass
Unsupervised Learning
Presentation transcript:

Document Clustering Matt Hughes

Document Clustering: What is it? The categorization, or clustering of documents based on term frequency and other relevancy measures Breaks down huge linear results into manageable sets

You use document clusters all the time: Table of contents Yahoo Human categorized; search is not true document clustering “More like this” Suggest a term

The Joke

DC is Human-centered searching Poor search skills Make the Web accessible for all people Represents the way we think; mirrors our brain (not hierarchical, but overlapping grouping) The answer to information overload

DC: Discover new patterns Visual representations allow user to see entire results on one page See patterns between sets Customer service (IBM) Gene mapping Stock market Domain independent; works with any topic Also supply your own data vocabulary

DC: Decrease time-to-result Google search for “Penn State” returns 1,400,000 results Can only display max of 100 results on one page; 14,000 pages to see all results Document clustering Filters duplicates Categorizes; find what you need in two or three pages

The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979 Marti Hearst UCB SIMS, Fall 98

“Berry-Picking” as an Information Seeking Strategy (Bates 90) Standard IR model The information need remains the same throughout the search session. Goal is to produce a perfect set of relevant docs. Berry-picking model (the Web) The query is continually shifting. Users may move through a variety of sources. New information may yield new ideas and new directions. The value of search is on the bits and pieces picked up along the way. Marti Hearst UCB SIMS, Fall 98

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 90) Q2 Q4 Q3 Q1 Q5 Q0 Marti Hearst UCB SIMMS, Fall 98

Problems with Document Clustering Variability in the quality of results Can be improved by providing a vocabulary Not good at differentiating homogenous collections Currently slower than linear, “Pagerank”-like technologies

What has been done so far? Visual DC maps 2D mapping 3D mapping Traditional DC search engines Customer service Automated content creation http://news.google.com

2D DC Mapping: Webbrain.com

2D DC Mapping: Smartmoney.com

3D Clustering (image from Wise et al 95)

Traditional DC Search Engines http://www.vivisimo.com http://www.infonetware.com

Customer Service Mapping (IBM)

How does Document Clustering work?

Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others; i.e, filters out duplicates, redundant documents Marti Hearst UCB SIMS, Fall ‘98

Document Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall ‘98

Document Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Marti Hearst UCB SIMS, Fall ‘98

Document/Document Matrix Marti Hearst UCB SIMS, Fall ‘98

Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall ‘98

Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall ‘98

Agglomerative Clustering A B C D E F G H I Marti Hearst UCB SIMS, Fall ‘98

K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering take a small sample group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary Marti Hearst UCB SIMS, Fall ‘98

Category Labels Advantages: Disadvantages Interpretable Capture summary information Describe multiple facets of content Domain dependent, and so descriptive Disadvantages Do not scale well (for organizing documents) Domain dependent, so costly to acquire May mis-match users’ interests Marti Hearst UCB SIMS, Fall 98

What to do next? Visual interfaces Add as last step for Google Must be standardized as other search engines have been Add as last step for Google Give people options on how to search

Works Cited “Lightweight Document Clustering.” Sholom M. Weiss, Brian F. White, and Chidanand V. Apte; IBM T.J. Watson Research Center “SIMS 296a-3: UI Background.” Marti Hearst

Any questions?