Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
 Grouper: A Dynamic Clustering Interface to Web Search Results Fatih Çalı ş ır Tolga Çekiç Elif Dal Acar Erdinç /9.
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Reading - Skimming Meeting 12 Matakuliah: G0794/Bahasa Inggris Tahun: 2007.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Modern Information Retrieval
IR Models: Structural Models
Aki Hecht Seminar in Databases (236826) January 2009
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Querying Structured Text in an XML Database By Xuemei Luo.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS by Oren Zamir and Oren Etzioni Presented by: Duygu Sarıkaya,Ahsen Yergök,Dilek Demirbaş.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Information Retrieval in Practice
Clustering of Web pages
Data Mining K-means Algorithm
Mining Anchor Text for Query Refinement
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998

Introduction Clustering web search results Snippets vs. web pages “fast” algorithm Post-retrieval document browsing technique 2Talk on: "Web Document Clustering: A Feasibility Demonstration"

Suffix Tree Clustering(STC) STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases. STC satisfies the key requirements: STC treats a document as a string, making use of proximity information between words. STC is novel, incremental, and O(n) time algorithm. STC succinctly summarizes clusters’ contents for users. smaller set Quick because of working on smaller set of documents, incremantality …

Operating procedure of STC Step1: Document “cleaning” Html -> plain text Words stemming Mark sentence boundaries Remove non-word tokens Step 2: Identifying Base Clusters Step3: Combining Base Clusters

Step2: Identifying base Clusters — Suffix Tree * STC treats a document as a set of strings… Suffix tree of string S: a compact tree containing all the suffixes of S String1: “cat ate cheese”, String2: “mouse ate cheese too” String3: “cat ate mouse too”

Base cluster 6 Base cluster scoring: s(B) = |B| ⋅ f(|P|) Talk on: "Web Document Clustering: A Feasibility Demonstration"

Base clusters Base clusters corresponding to the suffix tree nodes

Cluster score s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in P that have a non- zero score zero score words: stopwords, too few( 40%)

Step 3:Combining Base Clusters Merge base clusters with a high overlap in their document sets documents may share multiple phrases. Similarity of B m and B n (0.5 is paramter) 1 iff | B m  B n | / | B m | > 0.5 =and | B m  B n | / | B n | > otherwise

Base Cluster Graph Node: cluster Edge: similarity between two clusters > 1 What if “ate” is in the stop word list?

STC is Incremental As each document arrives from the web, we “clean” it (linear with collection size) Add it to the suffix tree. Each node that is updated/created as a result of this is tagged(linear) Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters(linear) Check any changes to the final clusters(linear) Score and sort the final clusters, choose top 10...(linear)

STC allows cluster overlap… Why overlap is reasonable? Why overlap is reasonable? a document often has 1+ topics a document often has 1+ topics STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents But not too similar to be merged into one cluster.. But not too similar to be merged into one cluster..

Output 13Talk on: "Web Document Clustering: A Feasibility Demonstration"

Evaluation 10 queries with 200 documents/snippets each MetaCrawler Manually assigned relevance judgment  40 relevant document for each query Comparison to other clustering algorithms STC is superior Alteration of STC and the other algorithms Phrases and overlapping important for STC 14Talk on: "Web Document Clustering: A Feasibility Demonstration"

Properties of STC Relevance Group relevant docs separately from irrelevant ones Browsable Summaries Concise and accurate descriptions of clusters Overlap Documents can be in multiple clusters Snippet-tolerance High quality from snippets Speed “cluster up to 1,000 docs in a few seconds” Incrementality Start to cluster while receiving docs 15Talk on: "Web Document Clustering: A Feasibility Demonstration"

16 Questions… Talk on: "Web Document Clustering: A Feasibility Demonstration"