Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.

Slides:



Advertisements
Similar presentations
Metacrawler Melissa Cyr Information Literacy. A metasearch engine is a search tool that sends user requests to several other search engines and/or databases.
Advertisements

Multimedia Database Systems
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
 Grouper: A Dynamic Clustering Interface to Web Search Results Fatih Çalı ş ır Tolga Çekiç Elif Dal Acar Erdinç /9.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Interfaces for Selecting and Understanding Collections.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
An Overview of Relevance Feedback, by Priyesh Sudra 1 An Overview of Relevance Feedback PRIYESH SUDRA.
Information Retrieval
Internet Resources Discovery (IRD) Meta-Search Engines (MSEs)
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Using Hyperlink structure information for web search.
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Search Engine Architecture
PEERSPECTIVE.MPI-SWS.ORG ALAN MISLOVE KRISHNA P. GUMMADI PETER DRUSCHEL BY RAGHURAM KRISHNAMACHARI Exploiting Social Networks for Internet Search.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
CPT 499 Internet Skills for Educators Session Three Class Notes.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Data Mining and Text Mining. The Standard Data Mining process.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Information retrieval and PageRank
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Junghoo “John” Cho UCLA
Presentation transcript:

Web Document Clustering By Sang-Cheol Seok

1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently the most powerful search engine Google Metacrawler : a search engine which cluster retrieved web documents. Metacrawler

2. Approaches Using contents of documents Using user’s usage logs Using current search engines Using hyperlinks Other classical methods

(1) Using Contents of Documents Creating clusters based on snippets returned by web search engines. clusters based on snippets are almost as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters

(2) Using user’s usage logs Advantage: relevancy information is objectively reflected by the usage logs An experimental result on Cluster 1/shuttle/missions/41-c/news /shuttle/missions/61-b … Cluster 2/history/apollo/sa-2/news/ /history/apollo/sa-2/images … Cluster 3/software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm … …….

(3) Using current web search engines – Metacrawler Step1: When MetaCrawler receives a query, it posts the query to multiple search engines in parallel. Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable ) Metacrawler at U. of Washington. Metacrawler at U. of Washington

(4) Using hyperlinks Consider web documents as vertices and the hyperlinks as direct edges in a direct graph. Similarity-based clustering method was successfully used in image segmentation Kleinberg’s HITS algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones.

(4) Using Hyperlinks: continued cluster web documents based on both the textual and hyperlink the hyperlink structure is used as the dominant factor in the similarity metric

(5) Other classical clustering methods K-means method HAC (hierarchical agglomerative clustering) DBSCAN (Density-based SCAN) And Single-link and group-average methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used

3. Key requirements and future challenges (1) key requirements for Web document clustering methods Relevance Browsable Summaries Overlap Speed Incrementality for some methods.

3. Key requirements and future challenges: continued (2) Concerns on current methods Each method has pros and cons. Using hyperlinks : the best accuracy and still some room to improve and it does not overlap. STC : best to browse and for incrementality. Metacrawler : best to prune.

3. Key requirements and future challenges: continued Future challenges We can not take advantage of all pros of each method. Some pros work against other pros. So, we have to trade off. Moreover, we need to find improvements.