Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.

Slides:



Advertisements
Similar presentations
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
What is Cluster Analysis?
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Clustering Unsupervised learning Generating “classes”
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A New Suffix Tree Similarity Measure for Document Clustering
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Abdul Wahid, Xiaoying Gao, Peter Andreae
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Post-Ranking query suggestion by diversifying search Chao Wang.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Topical Clustering of Search Results Date : 2012/11/8 Resource : WSDM’12 Advisor : Dr. Jia-Ling Koh Speaker : Wei Chang 1.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Data Mining and Text Mining. The Standard Data Mining process.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Data Mining K-means Algorithm
Information Organization: Clustering
Presentation transcript:

Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in P that have a non-zero score zero score words: stopwords, too few( 40%) Tf-Idf is better 1

Improving Suffix Tree clustering Cluster similarity – Page overlap – Add: cluster label distance (word pair distance) Google normalised distance WikiMiner: wikilink similarity 2

Improving suffix tree clustering 3 rd step: cluster merging – If more than half overlapped pages, then merge – New: HAC 3

4 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington

5 Related Work: Web Page Clustering All Standard Algorithms – partitioning (k-means), hierarchical (agglomerative, divisive), ………… Web Features – structure, hyperlinks, colour Textual Features – STC: phrases, Lingo: latent semantic indexing Word Semantics – Global document analysis, co-occurrence statistics Query is never used

QDC – Query Directed Clustering 6 1: Find Base Clusters 2: Merge Clusters3: Split Clusters4: Select Clusters5: Clean Clusters

QDC – 1: Find Base Clusters Clean Pages Identify Base Clusters Prune Small Clusters Semantic Prune #1 Semantic Prune #2 7 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Game (5) Service (80) Forest (11) cluster size distance(cluster,query) Score #1 = Score #2 =

Car Home Page Toyota Specific Broad Query: Jaguar Ambiguous QDC – 1: Query Distance 8

QDC – 1: Find Base Clusters Removes Many Base Clusters – Normally Negative Effect on Performance But … Query Directed Score – Reliable Guide to Cluster Quality – Removes just Low Quality Clusters – Improves Performance 9

QDC – 2: Merge Clusters Merging 10 Mac (28) Car (40) Auto (25) Animal (18) OS (12) Atari (22) Car, Auto (40) Mac, OS (28)

QDC – 2: Merge Clusters Single-link Clustering Similarity Function – Extension (by page overlap) – Intension (by description similarity) Global document analysis: co-occurrence frequency relative to expected frequency if independent 11

QDC – 2: Merge Clusters Reducing Page Overlap Threshold – Normally Negative Effect on Performance But … Description Similarity – More semantically related clusters merge Increasing cluster coverage – Fewer semantically unrelated clusters merge Increasing cluster quality 12

QDC – 3: Split Clusters Single Link Merging – Cluster Chaining (Drifting) Hierarchical Agglomerative – Distance Measure: Path Length 13

QDC – 4: Select Clusters ESTC cluster selection algorithm – Heuristic based hill-climbing search with look-ahead and advanced branch and bound pruning Original heuristic – Page Coverage and Cluster Overlap New heuristic – Page Coverage and Cluster Overlap – Pages Not Covered and Cluster Quality 14

QDC – 5: Clean Clusters Page-Cluster Relevance – Based on Base Cluster Membership – Cluster Size, Cluster Quality Remove Outliers and Erroneous Inclusions Sorting improves usability 15 13

Evaluation Algorithm Efficiency on 250 Documents – Ten Times Faster than STC – One Hundred Times Faster than K-means Algorithm Performance – External Evaluation against a rich gold standard Real World Usability – Informal Usability Comparison with four algorithms K-means, ESTC, Lingo, Vivisimo 16

Evaluation: Algorithm Performance External Evaluation against a rich gold standard Four Algorithms – STC, ESTC, K-means, Random Four Data Sets – Salsa, Jaguar, GP, Victoria University Eleven Measurements – Average and Weighted: Quality, Coverage, Precision, Recall, and Entropy + Mutual Information Snippets and Full Page Text 17

Evaluation: Quality and Coverage 18

Evaluation: Improvement over Random 19

Evaluation: Precision and Recall 20

Evaluation: Entropy and Mutual Information 21

Evaluation: Real World Usability QDC finds broader topics – Maximizes probability of refinement – Simplifies user’s decision process Fewer choices Less chance of multiple relevant choices Fewer semantically meaningless clusters 22 Jaguar Results

Evaluation: Real World Usability Performance better than indicated by external evaluation – No penalty for overly specific clusters since gold standard included them External evaluation shows QDC clusters have: – Fewer irrelevant pages – Cover more relevant pages 23

Conclusion QDC: New Web Page Clustering Algorithm Key innovations: – Query Directed Scoring – Merging using cluster descriptions – Solve cluster chaining by splitting – Improved cluster selection heuristic Vastly improved performance over other algorithms – External evaluation – Informal usability evaluation 24

25 Further Extension Use Phrases rather than just Words – STC, Lingo show large improvement possible Use Wiki Link similarity (WikiMiner) instead of GND Future work: – Improve cluster description similarity merging to consider entire description – Common shared phrases as key features, use VSM, build vectors for each cluster, new weighting – Formal usability evaluation