A New Suffix Tree Similarity Measure for Document Clustering

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Evaluation of Decision Forests on Text Categorization
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Querying Structured Text in an XML Database By Xuemei Luo.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
Why indexing? For efficient searching of a document
Clustering of Web pages
Information Retrieval and Web Search
Compact Query Term Selection Using Topically Related Text
Text Categorization Assigning documents to a fixed set of categories
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Feature Selection for Ranking
Boolean and Vector Space Retrieval Models
Relevance and Reinforcement in Interactive Browsing
Presentation transcript:

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW2007

INTRODUCTION 目的: To develop a document clustering algorithm to categorize the Web documents in an online community The Vector Space Document (VSD) - representation of any document as a feature vector of the words Suffix tree document model - identifying phrases that are common to groups of documents

suffix sub-string

Suffix Tree Document Model 1.cat ate cheese 2. mouse ate cheese too 3.cat ate mouse too

STC Algorithm (Suffix Tree Clustering) 1. The common suffix tree generating 2. Base cluster selecting Each base cluster B is assigned a score s(B) |B| = the number of documents in B |P| = the number of words in Phase 3. Cluster merging Jaccord coefficient

The base cluster graph

Problem of STC STC algorithm sometimes generates some large-sized clusters with poor quality No quality measure like tf-idf No single-link, group-average and complete-link Solution mapping each node of a suffix tree into a unique dimension of a M dimensional space M = total number of nodes in the suffix tree except the root node

The New Suffix Tree Similarity Measure Each document d can be represented as a feature vector of the weights of M nodes df(n) = the number of the different documents that have traversed node n tf(n, d) = the total traversed times of document d through node n ex. df(b) = 3 , tf(b,1) =1

The New Suffix Tree Similarity Measure tf-idf formula cosine similarity GAHC algorithm (GA with HC mutation )

A Closer Look to Sufx Tree Document Model Efciency Analysis constructing the suffix tree O(m^2) Ukkonen's paper provided a algorithm to build a suffix tree in O(m) Stopword or Stopnode Words in the stoplist - the score s(B) of a base cluster stopnode - A node with a high df can be ignored

Document Preparing 1. combine all posts of the same thread into a single document 2. all non-word tokens are stripped 3. all stopwords are identified and removed 4. Porter stemming algorithm is applied 6. the posts containing at least 3 distinct words are selected

Cluster Topic Summary Generating topic summary generating concerns two important information retrieval work 1. ranking the documents in a cluster by a quality score 2. extracting common phrases as the topic summary

Cluster Topic Summary Generating Document quality evaluation Web documents provide some additional human assessments for the document quality evaluation view clicks, reply posts and recommend clicks top 10% documents as the representatives of the cluster the nodes traversed by the representative documents are selected and sorted by their idf in ascend order. Finally the top 5 nodes are selected.

EVALUATION 系統產生的 cluster C = {C1,C2, …,Ck} 答案的cluster Recall (i, j) = Precision (i, j) =

Document Collections OHSUMED Document Collection 8 category, 800 documents, containing 6,281 distinct words. The average length of the documents is about 110 (by words) RCV1 Document Collection 10 groups of documents, containing 19,229 distinct words. The average length of documents is about 150

Results and Discussion

Results and Discussion STC algorithm - there is no effective measure to evaluate the quality of the clusters during the cluster merging Thus STC algorithm seldom generated large size clusters with high quality in the experiments

Results and Discussion DS3 document

CONCLUSIONS AND FUTURE WORK By completely mapping all nodes in the common suffix tree into a M dimensional space of VSD model, the advantages of VSD model and suffix tree model are smoothly inherited suffix tree similarity measure is very simple, but the implementation is quite difficult time efficiency and the space efficiency Applying the new similarity measure in Chinese document