E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
10/6/2015Nikos Hourdakis, MSc Thesis1 Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Clustering.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Information Organization: Clustering
DATA MINING Introductory and Advanced Topics Part II - Clustering
Text Categorization Berlin Chen 2003 Reference:
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]ACM CS’99  Instances within a cluster are very similar  Instances in different clusters are very different

E.G.M. PetrakisText Clustering2 Example term1 term2term2

E.G.M. PetrakisText Clustering3 Applications  Faster retrieval  Faster and better browsing  Structuring of search results  Revealing classes and other data regularities  Directory construction  Better data organization in general

E.G.M. PetrakisText Clustering4 Cluster Searching  Similar instances tend to be relevant to the same requests  The query is mapped to the closest cluster by comparison with the cluster-centroids

E.G.M. PetrakisText Clustering5 Notation  N: number of elements  Class: real world grouping – ground truth  Cluster: grouping by algorithm  The ideal clustering algorithm will produce clusters equivalent to real world classes with exactly the same members

E.G.M. PetrakisText Clustering6 Problems  How many clusters ?  Complexity? N is usually large  Quality of clustering  When a method is better than another?  Overlapping clusters  Sensitivity to outliers

E.G.M. PetrakisText Clustering7 Example

E.G.M. PetrakisText Clustering8 Clustering Approaches  Divisive: build clusters “top down” starting from the entire data set  K-means, Bisecting K-means  Hierarchical or flat clustering  Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher level  Hierarchical clustering  Combinations of the above  Buckshot algorithm

E.G.M. PetrakisText Clustering9 Hierarchical – Flat Clustering  Flat: all clusters at the same level  K-means, Buckshot  Hierarchical: nested sequence of clusters  Single cluster with all data at the top & singleton clusters at the bottom  Intermediate levels are more useful  Every intermediate level combines two clusters from the next lower level  Agglomerative, Bisecting K-means

E.G.M. PetrakisText Clustering10 Flat Clustering

E.G.M. PetrakisText Clustering11 Hierarchical Clustering

E.G.M. PetrakisText Clustering12 Text Clustering  Finds overall similarities among documents or groups of documents  Faster searching, browsing etc.  Needs to know how to compute the similarity (or equivalently the distance) between documents

E.G.M. PetrakisText Clustering13 Query – Document Similarity  Similarity is defined as the cosine of the angle between document and query vectors θ d1d1 d2d2

E.G.M. PetrakisText Clustering14 Document Distance  Consider documents d 1, d 2 with vectors u 1, u 2  Their distance is defined as the length AB

E.G.M. PetrakisText Clustering15 Normalization by Document Length  The longer the document is, the more likely it is for a given term to appear in it  Normalize the term weights by document length (so terms in long documents are not given more weight)

E.G.M. PetrakisText Clustering16 Evaluation of Cluster Quality  Clusters can be evaluated using internal or external knowledge  Internal Measures: intra cluster cohesion and cluster separability  intra cluster similarity  inter cluster similarity  External measures: quality of clusters compared to real classes  Entropy (E), Harmonic Mean (F)

E.G.M. PetrakisText Clustering17 Intra Cluster Similarity  A measure of cluster cohesion  Defined as the average pair-wise similarity of documents in a cluster  Where : cluster centroid  Documents (not centroids) have unit length

E.G.M. PetrakisText Clustering18 Inter Cluster Similarity a)Single Link: similarity of two most similar members b)Complete Link: similarity of two least similar members c)Group Average: average similarity between members

E.G.M. PetrakisText Clustering19 Example.... c c’ single link complete link group average S S’

E.G.M. PetrakisText Clustering20 Entropy  Measures the quality of flat clusters using external knowledge  Pre-existing classification  Assessment by experts  P ij : probability that a member of cluster j belong to class i  The entropy of cluster j is defined as E j =- Σ i P ij logP ij

E.G.M. PetrakisText Clustering21 Entropy (con’t)  Total entropy for all clusters  Where n j is the size of cluster j  m is the number of clusters  N is the number of instances  The smaller the value of E is the better the quality of the algorithm is  The best entropy is obtained when each cluster contains exactly one instance

E.G.M. PetrakisText Clustering22 Harmonic Mean (F)  Treats each cluster as a query result  F combines precision (P) and recall (R)  F ij for cluster j and class i is defined as n ij : number of instances of class i in cluster j, n i : number of instances of class i, n j : number of instances of cluster j

E.G.M. PetrakisText Clustering23 Harmonic Mean (con’t)  The F value of any class i is the maximum value it achieves over all j F i = max j F ij  The F value of a clustering solution is computed as the weighted average over all classes  Where N is the number of data instances

E.G.M. PetrakisText Clustering24 Quality of Clustering  A good clustering method  Maximizes intra-cluster similarity  Minimizes inter cluster similarity  Minimizes Entropy  Maximizes Harmonic Mean  Difficult to achieve all together simultaneously  Maximize some objective function of the above  An algorithm is better than an other if it has better values on most of these measures

E.G.M. PetrakisText Clustering25 K-means Algorithm  Select K centroids  Repeat I times or until the centroids do not change  Assign each instance to the cluster represented by its nearest centroid  Compute new centroids  Reassign instances  Compute new centroids  …….

6/1/2016Nikos Hourdakis, MSc Thesis26 K-Means demo (1/7):

6/1/2016Nikos Hourdakis, MSc Thesis27 K-Means demo (2/7)

6/1/2016Nikos Hourdakis, MSc Thesis28 K-Means demo (3/7)

6/1/2016Nikos Hourdakis, MSc Thesis29 K-Means demo (4/7)

6/1/2016Nikos Hourdakis, MSc Thesis30 K-Means demo (5/7)

6/1/2016Nikos Hourdakis, MSc Thesis31 K-Means demo (6/7)

6/1/2016Nikos Hourdakis, MSc Thesis32 K-Means demo (7/7)

E.G.M. PetrakisText Clustering33 Comments on K-Means (1)  Generates a flat partition of K clusters  K is the desired number of clusters and must be known in advance  Starts with K random cluster centroids  A centroid is the mean or the median of a group of instances  The mean rarely corresponds to a real instance

E.G.M. PetrakisText Clustering34 Comments on K-Means (2)  Up to I=10 iterations  Keep the clustering resulted in best inter/intra similarity or the final clusters after I iterations  Complexity O(IKN)  A repeated application of K-Means for K=2, 4,… can produce a hierarchical clustering

E.G.M. PetrakisText Clustering35 Choosing Centroids for K-means  Quality of clustering depends on the selection of initial centroids  Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.  Select good initial centroids using a heuristic or the results of another method  Buckshot algorithm

E.G.M. PetrakisText Clustering36 Incremental K-Means  Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration  Reassign instances to clusters at the end of each iteration  Converges faster than simple K-means  Usually 2-5 iterations

E.G.M. PetrakisText Clustering37 Bisecting K-Means  Starts with a single cluster with all instances  Select a cluster to split: larger cluster or cluster with less intra similarity  The selected cluster is split into 2 partitions using K-means (K=2)  Repeat up to the desired depth h  Hierarchical clustering  Complexity O(2hN)

E.G.M. PetrakisText Clustering38 Agglomerative Clustering  Compute the similarity matrix between all pairs of instances  Starting from singleton clusters  Repeat until a single cluster remains  Merge the two most similar clusters  Replace them with a single cluster  Replace the merged cluster in the matrix and update the similarity matrix  Complexity O(N 2 )

E.G.M. PetrakisText Clustering39 Similarity Matrix C 1 =d 1 C 2 =d 2 …C N =d N C 1 =d …0.3 C 2 =d …0.6 ….……1… C N =d N …1

E.G.M. PetrakisText Clustering40 Update Similarity Matrix C 1 =d 1 C 2 =d 2 …C N =d N C 1 =d …0.3 C 2 =d …0.6 ….……1… C N =d N …1 merged

E.G.M. PetrakisText Clustering41 New Similarity Matrix C 12 = d 1  d 2 …C N =d N C 12 = d 1  d 2 1…0.4 ……1… C N =d N 0.4…1

E.G.M. PetrakisText Clustering42 Single Link  Selecting the most similar clusters for merging using single link  Can result in long and thin clusters due to “chaining effect”  Appropriate in some domains, such as clustering islands

E.G.M. PetrakisText Clustering43 Complete Link  Selecting the most similar clusters for merging using complete link  Results in compact, spherical clusters that are preferable

E.G.M. PetrakisText Clustering44 Group Average  Selecting the most similar clusters for merging using group average  Fast compromise between single and complete link

E.G.M. PetrakisText Clustering45 Example.... c1c1 c2c2 single link complete link group average A B

E.G.M. PetrakisText Clustering46 Inter Cluster Similarity  A new cluster is represented by its centroid  The document to cluster similarity is computed as  The cluster-to-cluster similarity can be computed as single, complete or group average similarity

E.G.M. PetrakisText Clustering47 Buckshot K-Means  Combines Agglomerative and K-Means  Agglomerative results in a good clustering solution but has O(N 2 ) complexity  Randomly select a sample  N instances  Applying Agglomerative on the sample which takes (N) time  Take the centroids of the cluster as input to K-Means  Overall complexity is O(N)

E.G.M. PetrakisText Clustering48 Example initial cetroids for K-Means

E.G.M. PetrakisText Clustering49 More on Clustering  Sound methods based on the document- to-document similarity matrix  graph theoretic methods  O(N 2 ) time  Iterative methods operating directly on the document vectors  O(NlogN),O(N 2 /logN), O(mN) time

E.G.M. PetrakisText Clustering50 Soft Clustering  Hard clustering: each instance belongs to exactly one cluster  Does not allow for uncertainty  An instance may belong to two or more clusters  Soft clustering is based on probabilities that an instance belongs to each of a set of clusters  probabilities of all categories must sum to 1  Expectation Minimization (EM) is the most popular approach

E.G.M. PetrakisText Clustering51 More Methods  Two documents with similarity > T (threshold) are connected with an edge [Duda&Hart73]  clusters: the connected components (maximal cliques) of the resulting graph  problem: selection of appropriate threshold T  Zahn’s method [Zahn71]

E.G.M. PetrakisText Clustering52 Zahn’s method [Zahn71] 1.Find the minimum spanning tree 2. for each doc delete edges with length l > l avg  l avg : average distance if its incident edges 3.clusters: the connected components of the graph the dashed edge is inconsistent and is deleted

E.G.M. PetrakisText Clustering53 References  "Searching Multimedia Databases by Content", Christos Faloutsos, Kluwer Academic Publishers, 1996  “A Comparison of Document Clustering Techniques”, M. Steinbach, G. Karypis, V. Kumar, In KDD Workshop on Text Mining,2000A Comparison of Document Clustering Techniques  “Data Clustering: A Review”, A.K. Jain, M.N. Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31, No. 3, Sept. 99.Data Clustering: A Review  “Algorithms for Clustering Data” A.K. Jain, R.C. Dubes; Prentice-Hall, 1988, ISBN XAlgorithms for Clustering Data  “Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer”, G. Salton, Addison-Wesley, 1989