Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering II.
Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Introduction to Bioinformatics - Tutorial no. 12
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
DOCUMENT CLUSTERING. Clustering  Automatically group related documents into clusters.  Example  Medical documents  Legal documents  Financial documents.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Clustering.
Clustering C.Watters CS6403.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Unsupervised Learning
Clustering Patrice Koehl Department of Biological Sciences
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
Clustering.
John Nicholas Owen Sarah Smith
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Information Organization: Clustering
Revision (Part II) Ke Chen
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Clustering The process of grouping samples so that the samples are similar within each group.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Clustering

What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis is that if one document in a cluster is relevant, all the documents in that cluster will probably be relevant.

Similarity / Distance measures Cosine similarity measure Euclidian Distance = sqrt ( (q1-d1)² + (q2-d2)² + … + (qn-dn)²) Simple matching coefficient = number of features in common Manhattan distance = (q1-d1) + (q2-d2) + … + (qn-dn) Dice’s similarity measure = 2 * number of matches / (number of features in a + number of features in b)

Non-heirarchic clustering The data is partitioned into clusters of similar objects with no hierarchic relationship between the clusters. Clusters can be represented by their centroid, which is the “average” of all the cluster members, sometimes called a class exemplar. The similarity of the objects being clustered to the cluster centroid is measured by a similarity measure.

User-defined parameters The number of clusters desired (may arise automatically as part of the clustering procedure). The minimum and maximum size for each cluster The vigilance parameter: a threshold value on the similarity measure, below which an object will not be included in a cluster Control of the degree of overlap between clusters Non-hierarchical algorithms can be transformed into hierarchical algorithms by using the clusters obtained at one level as the objects to be classified at the next level, thus producing a hierarchy of clusters..

Single pass algorithm (one version) The objects to be clustered are processed one by one The first object description becomes the centroid of the first cluster. Each subsequent object is matched against all cluster centroids existing at its processing time The object is assigned to one cluster (or more if overlap is allowed) according to some condition on the similarity measure If an object fails to match any existing cluster sufficiently closely, it becomes the exemplar of a new cluster.

Single pass algorithm (example) Set vigilance parameter (VP) to 2 Pattern 1 = [4 0 2] automatically goes into the first cluster, which will have centroid [4 0 2] Pattern 2 = [4 0 1] is sufficiently close to the first cluster to join it as well, since the Manhattan distance from pattern 2 to the centroid of cluster 1 <= VP. Centroid of cluster 1 is updated to [ ] Pattern 3 = [0 5 0] forms its own new cluster, since it is too far away from the first cluster (Manhattan dist = 9.5, VP =2). The new (second) cluster starts with centroid [0 5 0] Pattern 4 = [1 4 0]. Manhattan distance from centroid 1 = 7.5, Manhattan distance from centroid 2 = 2. So pattern 4 goes into cluster 2, which now has the centroid [ ].

Two pass algorithm (MacQueen’s k- means method) Take the first k objects in the data set as clusters of one member each (seed points) Assign each of the remaining m-k objects to the cluster with the nearest centroid. After each assignment, recompute the centroid of the gaining cluster. After all objects have been assigned, take the existing cluster centroids as seed points and make one more pass through the data set assigning each object to the nearest seed point.

Hierarchic clustering methods Hierarchical document clustering methods produce tree-like categorisations (dendrograms) where small clusters of highly similar documents are included within much larger clusters of less similar documents The individual objects (e.g. documents) are represented by the leaves of the tree while the root of the tree represents the fact that all the objects ultimately combine into a single cluster. May be agglomerative (inside out, bottom up) or divisive (outside in, top down).

Divisive clustering We start with a single cluster containing all the documents, and sequentially subdivide it until we are left with the individual documents. Divisive methods tend to produce monothetic categorisations, where all the documents in a cluster must share certain index terms.

Outside in Clustering

Outside In (2)

Outside In (3)

Agglomerative clustering More common than divisive clustering, especially in information retrieval. Tend to produce polythetic categorisations, which are more useful in document retrieval. In a polythetic categorisation, documents are placed in a cluster with the greatest number of index terms in common, but there is no single index term which is a prerequisite for cluster membership.

Types of hierarchical agglomerative clustering techniques Single linkage (nearest neighbour) Average linkage Complete linkage (furthest neighbour) All these methods start from a matrix containing the similarity value between every pair of documents in the collection The following algorithm covers all 3 methods:

General heirarchical agglomerative clustering technique: For every document pair find SIM[i,j], the entry in the similarity matrix, then repeat the following: –Search the similarity matrix to identify the most similar remaining pair of clusters; –Fuse this pair K and L to form a new cluster KL; –Update SIM by calculating the similarity between the new cluster and each of the remaining clusters; Until there is only one cluster left.

Differences between single, average and complete linkage methods The methods vary in the method of updating the similarity matrix In the average linkage method, when two items are fused, the similarity matrix is updated by averaging the similarities to every other document Single linkage – the similarity between two documents is based on the most similar pair of documents Complete linkage – the similarity between two clusters is based on the least similar pair of documents

The validity of document clustering Danger: clustering methods will find patterns even in random data (think of the constellations). In general, methods which result in little modification of the original similarity data are better than those which distort the inter-object similarity data The most common distortion measure is the cophenetic correlation coefficient produced by comparing the values in the original similarity matrix with the inter-object similarities found in the resulting dendrogram.