Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.

Slides:



Advertisements
Similar presentations
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Advertisements

Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
ADVISE: Advanced Digital Video Information Segmentation Engine
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
Introduction to Bioinformatics - Tutorial no. 12
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Overview of Search Engines
Help People Find What They Don ’ t Know Hao Ma CSE, CUHK.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Haggle Architecture and Reference Implementation Uppsala, September Erik Nordström, Christian Rohner.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Data Mining and Text Mining. The Standard Data Mining process.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Information Retrieval in Practice
Search Engine Architecture
Clustering of Web pages
Visualizing Document Collections
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.)

Some requirements Fast – Immediate response to query Flexible – Web content changes constantly – overlapping User-oriented – Main goal is to aid the user in finding information – Meaningful labels – Visualization: GUI

Architecture of Web clustering engine

Main issues Online or offline clustering? What to use as input – Entire documents – Snippets – Structure information (links) – Other data (i.e. click-through) – Use stop word lists, stemming, etc. How to define similarity? – Content (i.e. vector-space model) – Link analysis – Usage statistics How to group similar documents? How to label the groups?

Clustering algorithms Flat or hierarchical? Overlapping? Hard or soft? (One object to one cluster or multiple clusters) Incremental? Predefined cluster number? Requiring explicit similarity measure? Distance measure?

Clustering algorithms Distance-based – Hierarchical Agglomerative Hierarchical Clustering (AHC) – Flat K-means (can be fuzzy) Single-pass (incremental) Other – Suffix Tree Clustering (Grouper) – Query directed clustering – Self-organizing (Kohonen) maps (neural networks) – Latent Semantic Indexing (LSI) (reducing the dimensionality of the vector-space)

Clustering algorithms Data Centric algorithms – E.g. Scatter/Gather – Vector space model, k-means or HAC – Cluster labels is not good enough Description-Aware algorithms – STC Description-Centric Algorithms – E.g Vivisimo, Lingo, SRC – Good labels.

Research Prototype systems

Commercial systems

Scatter/Gather from Lin and Pantel (2002) and Hearst and Pederson (1996)

KeySRC

KartOO

Others More at chnology.shtml

Scatter/Gather (Cutting et. al. 1992) Designed for browsing Based on two novel clustering algorithms – Buckshot – fast for online clustering – Fractionation – accurate for offline initial clustering of the entire set

Grouper (Zamir and Etzioni 1997, 1999) Online Operates on query result snippets Clusters together documents with large common subphrases Suffix Tree Clustering (STC): linear, incremental, overlapping, can be extended to hierarchical STC induces labeling

Iwona Białynicka-Birula - Clustering Web Search Results STC algorithm Step 1: Cleaning – Stemming – Sentence boundary identification – Punctuation elimination Step 2: Suffix tree construction – Produces base clusters (internal nodes) – Base clusters are scored based on size and phrase score (which depends on length and word „quality”) Step 3: Merging base clusters – Highly overlapping clusters are merged

Carrot 2 (Stefanowski and Weiss 2003) Component framework Allows substituting components for – Input (i.e. snippets from other search engines) – Filter Stemming Distance measure Clustering – Output

Vivísimo Commercial Online Hierarchical Conceptual

Iwona Białynicka-Birula - Clustering Web Search Results Other Mapuccino (IBM) – (Maarek et. al. 2000) – – Relatively efficient AHC (O(n 2 )) – Similarity based on vector-space model (Su et. al. 2001) – Only usage statistics used as input – Recursive Density Based Clustering SHOC – (Zhang and Dong 2004) – Grouper-like – Key phrase discovery

Problems The performance is far from being perfect – Incompleteness of clusters, Hard to tell why some cluster are generated, some are missing – Different cluster granularity, Some clusters are very specific, some are very broad – Inconsistency: the contents and the label, lack of intra and inter-cluster consistency. – Label expressiveness – Lack of evaluation, data sets

Future research trends To extract more powerful features: hyperlinks, external info, temporal attributes To generate more expressive or effective descriptions of clusters To improve the accuracy of the hierarchy structure To consider user characteristics, web usage data Integration with ontology Better visualisation of the clusters To apply to Mobile search XML documents clustering