Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Basic Concepts and Algorithms
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining and Text Mining. The Standard Data Mining process.
Semi-Supervised Clustering
Hierarchical Clustering: Time and Space requirements
Clustering CSC 600: Data Mining Class 21.
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Information Organization: Clustering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
CSE572: Data Mining by H. Liu
Presentation transcript:

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster vector and . Clustering is unsupervised pattern classification. Unsupervised means no correct answer or feedback. Patterns typically are samples of feature vectors or matrices. Classification means collecting the samples into groups of similar members.

Clustering Decisions Pattern Representation feature selection (e.g., stop word removal, stemming) number of categories Pattern proximity distance measure on pairs of patterns Grouping characteristics of clusters (e.g., fuzzy, hierarchical) Clustering algorithms embody different assumptions about these decisions and the form of clusters.

Formal Definitions Feature vector x is a single datum of d measurements. Hard clustering techniques assign a class label to each cluster; members of clusters are mutually exclusive. Fuzzy clustering techniques assign a fractional degree of membership to each label for each x.

Proximity Measures Generally, use Euclidean distance or mean squared distance. In IR, use similarity measure from retrieval (e.g., cosine measure for TFIDF).

[Jain, Murty & Flynn] Taxonomy of Clustering Clustering HierarchicalPartitional Single Link Complete Link Square Error Graph Theoretic Mixture Resolving Mode Seeking k-means Expectation Minimization HAC

Clustering Issues Agglomerative: begin with each sample in its own cluster and merge Divisive: begin with single cluster and split Hard: mutually exclusive cluster membership Fuzzy: degrees of membership in clusters DeterministicStochastic Incremental: samples may be added to clusters Batch: clusters created over entire sample space

Hierarchical Algorithms Produce hierarchy of classes (taxonomy) from singleton clusters to just one cluster. Select level for extracting cluster set. Representation is a dendrogram.

Complete-Link Revisited Used to create statistical thesaurus Agglomerative, hard, deterministic, batch 1. Start with 1 cluster/sample 2. Find two clusters with lowest distance 3. Merge two clusters and add to hierarchy 4. Repeat from 2 until termination criterion or until all clusters have merged

Single-Link Like Complete-Link except… use minimum of distances between all pairs of samples in the two clusters (complete-link uses maximum). Single-link has chaining effect with elongated clusters, but can construct more complex shapes.

Example:Plot

Example: Proximity Matrix 21,1526,2529,2231,1521,2723,3229,2633,21 21, , , , , , , ,210

Complete-Link Solution 1,28 4,9 9,16 13,18 21,1529,22 31,15 33,2135,35 42,45 45,4246,30 23,32 21,27 29,26 26,25 C1C2C3C4C5 C6C7C8C9 C10C11C12 C13C14 C15

Single-Link Solution 1,28 4,9 9,16 13,18 21,1529,22 31,15 33,2135,35 42,45 45,4246,30 23,32 21,27 29,26 26,25 C1 C4C5C6 C7 C9 C13 C10 C11 C15 C2 C3 C8 C12 C14

Hierarchical Agglomerative Clustering (HAC) Agglomerative, hard, deterministic, batch 1. Start with 1 cluster/sample and compute a proximity matrix between pairs of clusters. 2. Merge most similar pair of clusters and update proximity matrix. 3. Repeat 2 until all clusters merged. Difference is in how proximity matrix is updated. Ability to combine benefits of both single and complete link algorithms.

HAC for IR Intra-cluster Similarity where S is TFIDF vectors for documents, c is centroid of cluster X, and d is a document. Proximity is similarity of all documents to the cluster centroid. Select pair of clusters that produces the smallest decrease in similarity, e.g., if merge(X,Y)=>Z, then max[Sim(Z)-(Sim(X)+Sim(Y))]

HAC for IR- Alternatives Centroid Similarity cosine similarity between the centroid of the two clusters UPGMA

Partitional Algorithms Results in set of unrelated clusters. Issues: how many clusters is enough? how to search space of possible partitions? what is appropriate clustering criterion?

K Means Number of clusters is set by user to be k. Non-deterministic Clustering criterion is squared error: where S is document set, L is a clustering, K is number of clusters, x is ith document in jth cluster and c is centroid of jth cluster.

k-Means Clustering Algorithm 1. Randomly select k samples as cluster centroids. 2. Assign each pattern to the closest cluster centroid. 3. Recompute centroids. 4. If convergence criterion (e.g., minimal decrease in error or no change in cluster composition) is not met, return to 2.

Example:K-Means Solutions

k-Means Sensitivity to Initialization A B C DE FG K=3, red started w/A, D, F; yellow w/A, B, C

k-Means for IR Update centroids incrementally Calculate centroid as with hierarchical methods. Can refine into a divisive hierarchical method by starting with single cluster and splitting using k-means until forms k clusters with highest summed similarities. (bisecting k-means)

Other Types of Clustering Algorithms Graph Theoretic: construct minimal spanning tree and delete edges with largest lengths Expectation Minimization (EM): assume clusters are drawn from distributions, use maximum likelihood to estimate parameters of distributions. Nearest Neighbors: iteratively assign each sample to the cluster of its nearest labelled neighbor, so long as distance is below a set threshold.

Comparison of Clustering Algorithms [Steinbach et al.] Implement 3 versions of HAC and 2 versions of k-Means Compare performance on documents hand labelled as relevant to one of a set of classes. Well known data sets (TREC) Found that UPGMA is best of hierarchical, but bisecting k-means seems to do better if considered over many runs. M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering Techniques, KDD Workshop on Text Mining, 2000.A Comparison of Document Clustering Techniques

Evaluation Metrics 1 Evaluation: how to measure cluster quality? Entropy: where pij is probability that a member of cluster j belongs to class i, nj is size of cluster j, m is number of clusters, n is number of docs and CS is a clustering solution.

Comparison Measure 2 F measure: combines precision and recall treat each cluster as the result of a query and each class as the relevant set of docs nij is # of members of class i in cluster j, nj is # in j, ni is # in i, n is # of docs.