A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Albert Gatt Corpora and Statistical Methods Lecture 13.
 Grouper: A Dynamic Clustering Interface to Web Search Results Fatih Çalı ş ır Tolga Çekiç Elif Dal Acar Erdinç /9.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
A Linear Time Algorithm for the k Maximum Sums Problem By Gerth S. Brodal and Allan G. Jørgensen.
GROUPER: A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS Erdem Sarıgil O ğ uz Yılmaz
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
10/6/2015Nikos Hourdakis, MSc Thesis1 Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Algorithmic Complexity Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Chapter 5: Information Retrieval and Web Search
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Text mining.
Date: 2012/10/18 Author: Makoto P. Kato, Tetsuya Sakai, Katsumi Tanaka Source: World Wide Web conference (WWW "12) Advisor: Jia-ling, Koh Speaker: Jiun.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
 Person Name Disambiguation by Bootstrapping SIGIR’10 Yoshida M., Ikeda M., Ono S., Sato I., Hiroshi N. Supervisor: Koh Jia-Ling Presenter: Nonhlanhla.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Text Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
Chapter 6: Information Retrieval and Web Search
Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison February 2, 2010 Acknowledgments:
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.
Prepared by: Mahmoud Rafeek Al-Farra
Block p and g Generators. Carry Determination as Prefix Computations Two Contiguous (or Overlapping) Blocks (g’, p’) and (g’’, p’’) Merged Block (g, p)
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan.
Simplifying Algebraic Expressions. 1. Evaluate each expression using the given values of the variables (similar to p.72 #37-49)
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
1 CS 430: Information Discovery Lecture 24 Cluster Analysis.
SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.
Data Mining and Text Mining. The Standard Data Mining process.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Discussion Class 11 Cluster Analysis.
Information Retrieval in Practice
Data Structures Data Structure is a way of collecting and organising data in such a way that we can perform operations on these data in an effective.
Issues in Machine Learning
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
An Approach to Abstractive Multi-Entity Summarization
Cat.
Presentation transcript:

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07

1. Document Clustering Agglomerative Hierarchical Clustering (AHC)

Suffix Tree Clustering (STC) - commonly used in result clustering

2-1. Suffix Tree Clustering Ex: 3 documents cat ate cheese cat ate mouse too mouse ate cheese too

cat ate cheese

score(B) = |B| f(|P|) f: remove stopwords, <= 3, > 40% && penalize single word, constant for |P| > Base Cluster

2-3. Combining Base Cluster Keep top k(=500) base cluster Merge high overlap base clusters merge B i & B j iff |B i ∩B j | / |B i | > 0.5 |B j ∩B i | / |B j | > 0.5

2-4. Advantage High precision even using snippet Incremental and linear time Order Independent No magic k top k base clusters? 0.5?

3. New Suffix Tree Clustering d i T = [tfidf(n 1, d i ), tfidf(n 2, d i ), …] Group-average AHC (GAHC)

4. Evaluation Use F-measure precision(C i, G j ) = |C i ∩ G j | / |C i | recall(C i, G j ) = |C i ∩ G j | / | G j |

OHSUMED Document Collection MeSH indexing terms RCV1 Document Collection categories

5. Comparison STC : seldom generate large cluster NSTC : not incremental