Clustering (Search Engine Results) CSE 454. © 2000-2005 Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Albert Gatt Corpora and Statistical Methods Lecture 13.
 Grouper: A Dynamic Clustering Interface to Web Search Results Fatih Çalı ş ır Tolga Çekiç Elif Dal Acar Erdinç /9.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Information Retrieval in Practice
Machine Learning CSE 454. © Daniel S. Weld 2 Today’s Outline Brief supervised learning review Evaluation Overfitting Ensembles Learners: The more the.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Aki Hecht Seminar in Databases (236826) January 2009
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Overview of Search Engines
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Clustering Unsupervised learning Generating “classes”
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Daniel S. Weld 2 Course Overview Systems Foundation: Networking, Synchronization & Monitors Datamining Cluster Computing Crawler Architecture Case Studies:
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS by Oren Zamir and Oren Etzioni Presented by: Duygu Sarıkaya,Ahsen Yergök,Dilek Demirbaş.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Data Mining and Text Mining. The Standard Data Mining process.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Issues in Machine Learning
Federated & Meta Search
Information Organization: Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Clustering (Search Engine Results) CSE 454

© Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction

© Etzioni & Weld To improve presentation Details on STC creation algorithm Details on ranking base nodes Something on k-means Something on sue dumais approach This was a pretty short class – as is

© Etzioni & Weld Clustering Outline Motivation Document Clustering Offline evaluation Grouper I Grouper II Evaluation of deployed systems

© Etzioni & Weld Low Quality of Web Searches System perspective: –small coverage of Web (<16%) –dead links and out of date pages –limited resources IR perspective (relevancy of doc ~ similarity to query): –very short queries –huge database –novice users

© Etzioni & Weld Document Clustering User receives many ( ) documents from Web search engine Group documents in clusters –by topic Present clusters as interface

© Etzioni & WeldGrouper

© Etzioni & Weld

Desiderata Coherent cluster Speed Browsable clusters –Naming

© Etzioni & Weld Main Questions Is document clustering feasible for Web search engines? Will the use of phrases help in achieving high quality clusters? Can phrase-based clustering be done quickly?

© Etzioni & Weld 1. Clustering group together similar items (words or documents)

© Etzioni & Weld Clustering Algorithms Hierarchical Agglomerative Clustering –O(n 2 ) Linear-time algorithms –K-means (Rocchio, 66) –Single-Pass (Hill, 68) –Fractionation (Cutting et al, 92) –Buckshot (Cutting et al, 92)

© Etzioni & Weld Basic Concepts - 1 Hierarchical vs. Flat

© Etzioni & Weld Basic Concepts - 2 hard clustering: –each item in only one cluster soft clustering: –each item has a probability of membership in each cluster disjunctive / overlapping clustering: –an item can be in more than one cluster

© Etzioni & Weld Basic Concepts - 3 distance / similarity function (for documents) –dot product of vectors –number of common terms –co-citations –access statistics –share common phrases

© Etzioni & Weld Basic Concepts - 4 What is “right” number of clusters? –apriori knowledge –default value: “5” –clusters up to 20% of collection size –choose best based on external criteria –Minimum Description Length –Global Quality Function no good answer

© Etzioni & Weld Hierarchical Clustering Agglomerative –bottom-up Initialize: - each item a cluster Iterate: - select two most similar clusters - merge them Halt: when have required # of clusters

© Etzioni & Weld Hierarchical Clustering Divisive –top-bottom Initialize: -all items one cluster Iterate: - select a cluster (least coherent) - divide it into two clusters Halt: when have required # of clusters

© Etzioni & Weld HAC Similarity Measures Single link Complete link Group average Ward’s method

© Etzioni & Weld Single Link cluster similarity = similarity of two most similar members

© Etzioni & Weld Single Link O(n 2 ) chaining: bottom line: –simple, fast –often low quality

© Etzioni & Weld Complete Link cluster similarity = similarity of two least similar members

© Etzioni & Weld Complete Link worst case O(n 3 ) fast algo requires O(n 2 ) space no chaining bottom line: –typically much faster than O(n 3 ), –often good quality

© Etzioni & Weld Group Average cluster similarity = average similarity of all pairs

© Etzioni & Weld HAC Often Poor Results - Why? Often produces single large cluster Work best for: –spherical clusters; equal size; few outliers Text documents: –no model –not spherical; not equal size; overlap Web: –many outliers; lots of noise

© Etzioni & Weld Example: Clusters of Varied Sizes k-means; complete-link; group-average: single-link: chaining, but succeeds on this example

© Etzioni & Weld Example - Outliers HAC:

© Etzioni & Weld Suffix Tree Clustering (KDD’97; SIGIR’98) Most clustering algorithms aren’t specialized for text: Model document as set of words STC: document = sequence of words

© Etzioni & Weld STC Characteristics Coherent –phrase-based –overlapping clusters Speed and Scalability –linear time; incremental Browsable clusters –phrase-based –simple cluster definition

© Etzioni & Weld STC - Central Idea Identify base clusters –a group of documents that share a phrase –use a suffix tree Merge base clusters as needed

© Etzioni & Weld STC - Outline Three logical steps: “Clean” documents Use a suffix tree to identify base clusters - a group of documents that share a phrase Merge base clusters to form clusters

© Etzioni & Weld Step 1 - Document “Cleaning” Identify sentence boundaries Remove –HTML tags, –JavaScript, –Numbers, –Punctuation

© Etzioni & Weld Suffix Tree (Weiner, 73; Ukkonen, 95; Gusfield, 97) Example - suffix tree of the string: (1) "cats eat cheese"

© Etzioni & Weld Example - suffix tree of the strings: (1) "cats eat cheese", (2) "mice eat cheese too" and (3) "cats eat mice too"

© Etzioni & Weld Step 2 - Identify Base Clusters via Suffix Tree Build one suffix tree from all sentences of all documents Suffix tree node = base cluster Score all nodes Traverse tree and collect top k (500) base clusters

© Etzioni & Weld Step 3 - Merging Base Clusters Motivation: similar documents share multiple phrases Merge base clusters based on the overlap of their document sets Example (query: “salsa”) “tabasco sauce” docs: 3,4,5,6 “hot pepper”docs: 1,3,5,6 “dance”docs: 1,2,7 “latin music”docs: 1,7,8

© Etzioni & Weld 16% increase over k-means (not stat. sig.) Average Precision - WSR-SNIP

© Etzioni & Weld 45% increase over k-means (stat. sig.) Average Precision - WSR-DOCS

© Etzioni & Weld Grouper II Dynamic Index: –Non-merged based clusters Multiple interfaces: –List, Clusters + Dynamic Index (key phrases) Hierarchical: –Interactive “Zoom In” feature –(similar to Scatter/Gather)

© Etzioni & Weld

Evaluation - Log Analysis

© Etzioni & Weld Northern Light “Custom Folders” predefined topics in a manually developed hierarchy Classify document into topics Display “dominant” topics in search results

© Etzioni & Weld

Summary Post-retrieval clustering –to address low precision of Web searches STC –phrase-based; overlapping clusters; fast Offline evaluation –Quality of STC, –advantages of using phrases vs. n-grams, FS Deployed two systems on the Web –Log analysis: Promising initial results