S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Categorical Data The Case of Quran Verses
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
PARTITIONAL CLUSTERING
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis (1).
What is Cluster Analysis?
What is Cluster Analysis?
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture 09 Clustering-based Learning
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Presented by Tienwei Tsai July, 2005
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Fuzzy C-Means Clustering
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
C LUSTERING FOR T AXONOMY E VOLUTION By -Anindya Das - Sneha Bankar.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Similarity Measures for Text Document Clustering
Clustering Algorithms
Data Mining K-means Algorithm
Information Organization: Clustering
Representation of documents and queries
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Text Categorization Berlin Chen 2003 Reference:
Clustering The process of grouping samples so that the samples are similar within each group.
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah Kamw

1.INTRODUCTION Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent cluster. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, and cosine similarity. Text document clustering groups similar documents to form a coherent cluster, while documents that are di ff erent have separated apart into di ff erent clusters.

1.INTRODUCTION ( CONT.) In this paper, they compare and analyze the e ff ectiveness of these measures in partitional clustering for text document datasets. their experiments utilize the standard K-means algorithm and they report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering. They use two measures to evaluate the overall quality of clustering solutions—purity and entropy, which are commonly used in clustering.

2.DOCUMENT REPRESENTATION They use the frequency of each term as its weight, which means terms that appear more frequently are more important and descriptive for the document. Let D = { d 1,..., d n } be a set of documents and T = {t 1,...,t m } the set of distinct terms occurring in D. A document is then represented as a m- dimensional vector t d. Let tf(d, t) denote the frequency of term t ∈ T in document d ∈ D. Then the vector representation of a document d is

Terms are basically words. But they applied several standard transformations on the basic term vector representation. First, the stop words are removed. There are words that are non-descriptive for the topic of a document, such as a, and, are and do. They used learning workbench, which contains 527 stop words. Second, words were stemmed using Porter’s su ffi x- stripping algorithm, so that words with di ff erent endings will be mapped into a single word. For example production, produce, produces and product will be mapped to the stem produce. 2.DOCUMENT REPRESENTATION ( CONT.)

Third, the words that appear with less than a given threshold frequency were discarded. Consequently, they select the top 2000 words ranked by their weights and use them in their experiments. In addition, the terms that appear frequently in a small number of documents but rarely in the other documents tend to be more important, and therefore more useful for finding similar documents. So, they transform the basic term frequencies tf(d, t) into the tfidf. 2.DOCUMENT REPRESENTATION ( CONT.)

3.SIMILARITY MEASURES In general, similarity/distance measures map the distance or similarity between the symbolic description of two objects into a single numeric value, which depends on two factors— the properties of the two objects and the measure itself.

3.1 M ETRIC To qualify as a metric, a measure d must satisfy the following four conditions. Let x and y be any two objects in a set and d(x, y) be the distance between x and y. 1. d(x, y) >= d(x, y) = 0 if and only if x = y. 3. d(x, y) = d(y, x). 4. d(x, z) = d(x, y) + d(y, z).

3.2 E UCLIDEAN D ISTANCE Euclidean distance is widely used in clustering problems, including clustering text. It satisfies all the above four conditions and therefore is a true metric. Measuring distance between text documents, given two documents d a and d b represented by their term vectors t a and t b respectively, the Euclidean distance of the two documents is defined as:

3.3 C OSINE S IMILARITY When documents are represented as term vectors, the similarity of two documents corresponds to the correlation between the vectors. This is quantified as the cosine of the angle between vectors. Where t a and t b are m-dimensional vectors over the term set T = {t 1,…,t m }. Each dimension represents a term with its weight in the document, which is non- negative. As a result, the cosine similarity is non- negative and bounded between [0,1].

3.4 J ACCARD C OEFICIENT Measures similarity as the intersection divided by the union of the objects. The formal definition is: The Jaccard coe ffi cient is a similarity measure and ranges between 0 and 1, where 1 means the two objects are the same and 0 means they are completely di ff erent. The corresponding distance measure is D J = 1 - SIM J

3.5 P EARSON C ORRELATION C OEFfiCIENT Given the term set T={t 1,…,t m }a commonly used form is: This is also a similarity measure. However, unlike the other measures, it ranges from +1 to -1 and it is 1 when t a =t b. D p =1-sim p when sim p >=0,and D p =sim p if sim p <0.

3.6 A VERAGED KL D IVERGENCE In information theory based clustering, a document is considered as a probability distribution of terms. The similarity of two documents is measured as the distance between the two corresponding probability distributions. The entropy divergence (KL divergence), is a widely applied measure for evaluating the differences between two probability distributions.

This is an iterative partitional clustering process that aims to minimize the least squares error criterion. The K-means algorithm works as follows. 1. Given a set of data objects D and a pre-specified number of clusters k, 2. k data objects are randomly selected to initialize k clusters, each one being the centroid of a cluster. 3. The remaining objects are then assigned to the cluster represented by the nearest or most similar centroid. 4. A new centroids are re-computed for each cluster and in turn all documents are re-assigned based on the new centroids. This step iterates until all data objects remain in the same cluster after an update of centroids. 4. K-Mean Clustering Algorithm

5.1 Datasets 5.Experment

The quality of a clustering result was evaluated using two evaluation measures—purity and entropy. The purity measure evaluates the coherence of a cluster, that is, the degree to which a cluster contains documents from a single category. Given a particular cluster C i of size n i,The purity of C i is formally defined as: In general, the higher the purity value, the better the quality of the cluster is. 5.2 Evaluation

The entropy measure evaluates the distribution of categories in a given cluster. The entropy of a cluster C i with size n i is defined as: For an ideal cluster with documents from only a single category, the entropy of the cluster will be 0. In general, the smaller the entropy value, the better the quality of the cluster is. 5.2 Evaluation (cont.)

Thank You