Clustering Basic Concepts and Algorithms

Slides:



Advertisements
Similar presentations
K-Means Clustering Algorithm Mining Lab
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
What is Cluster Analysis?
Clustering AMCS/CS 340: Data Mining Xiangliang Zhang
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Cluster Analysis: Basic Concepts and Algorithms
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Data Mining Techniques: Clustering
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
What is Cluster Analysis
What is Cluster Analysis?
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Clustering Bamshad Mobasher DePaul University.
DATA MINING CLUSTERING K-Means.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Clustering C.Watters CS6403.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Data Mining: Basic Cluster Analysis
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
CSC 4510/9010: Applied Machine Learning
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
Information Organization: Clustering
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Data Mining: Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Introduction to Machine learning
Presentation transcript:

Clustering Basic Concepts and Algorithms Bamshad Mobasher DePaul University

What is Clustering in Data Mining? Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters Helps users understand the natural grouping or structure in a data set Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group but as a collection, they are sufficiently different from other groups Clustering unsupervised classification no predefined classes

Applications of Cluster Analysis Data reduction Summarization: Preprocessing for regression, PCA, classification, and association analysis Compression: Image processing: vector quantization Hypothesis generation and testing Prediction based on groups Cluster & find characteristics/patterns for each group Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection: Outliers are often viewed as those “far away” from any cluster

Basic Steps to Develop a Clustering Task Feature selection / Preprocessing Select info concerning the task of interest Minimal information redundancy May need to do normalization/standardization Distance/Similarity measure Similarity of two feature vectors Clustering criterion Expressed via a cost function or some rules Clustering algorithms Choice of algorithms Validation of the results Interpretation of the results with applications

Distance or Similarity Measures Common Distance Measures: Manhattan distance: Euclidean distance: Cosine similarity:

More Similarity Measures In vector-space model many similarity measures can be used in clustering Dice’s Coefficient Simple Matching Cosine Coefficient Jaccard’s Coefficient

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering method depends on the similarity measure used its implementation, and Its ability to discover some or all of the hidden patterns

Major Clustering Approaches Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSCAN, OPTICS, DenClue Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB

Partitioning Approaches The notion of comparing item similarities can be extended to clusters themselves, by focusing on a representative vector for each cluster cluster representatives can be actual items in the cluster or other “virtual” representatives such as the centroid this methodology reduces the number of similarity computations in clustering clusters are revised successively until a stopping condition is satisfied, or until no more changes to clusters can be made Reallocation-Based Partitioning Methods Start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning Most common algorithm: k-means

The K-Means Clustering Method Given the number of desired clusters k, the k-means algorithm follows four steps: Randomly assign objects to create k nonempty initial partitions (clusters) Compute the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest centroid (reallocation step) Go back to Step 2, stop when the assignment does not change

K-Means Example: Document Clustering Initial (arbitrary) assignment: C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6} Cluster Centroids Now compute the similarity (or distance) of each item to each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} Example (Continued) For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment. C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation…

Example (Continued) C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} Now compute new cluster centroids using the original document-term matrix This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity. New assignment  C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7} Note: This process is now repeated with new clusters. However, the next iteration in this example Will show no change to the clusters, thus terminating the algorithm.

K-Means Algorithm Strength of the k-means: Weakness of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n Often terminates at a local optimum Weakness of the k-means: Applicable only when mean is defined; what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Variations of K-Means usually differ in: Selection of the initial k means Distance or similarity measures used Strategies to calculate cluster means

A Disk Version of k-means k-means can be implemented with data on disk In each iteration, it scans the database once The centroids are computed incrementally It can be used to cluster large datasets that do not fit in main memory We need to control the number of iterations In practice, a limited is set (< 50) There are better algorithms that scale up for large data sets, e.g., BIRCH

BIRCH Designed for very large data sets Two key phases: Time and memory are limited Incremental and dynamic clustering of incoming objects Only one scan of data is necessary Does not need the whole data set in advance Two key phases: Scans the database to build an in-memory tree Applies clustering algorithm to cluster the leaf nodes

Hierarchical Clustering Algorithms Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Hierarchical Clustering Algorithms Use dist / sim matrix as clustering criteria does not require the no. of clusters as input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative a ab b abcde c cd d cde e Divisive Step 4 Step 3 Step 2 Step 1 Step 0

Hierarchical Agglomerative Clustering Basic procedure 1. Place each of N items into a cluster of its own. 2. Compute all pairwise item-item similarity coefficients Total of N(N-1)/2 coefficients 3. Form a new cluster by combining the most similar pair of current clusters i and j (methods for determining which clusters to merge: single-link, complete link, group average, etc.); update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j. 4. Repeat step 3 if the number of clusters left is great than 1.

Hierarchical Agglomerative Clustering :: Example 5 4 1 2 3 4 5 6 2 3 1 Dendrogram Nested Clusters

Distance Between Two Clusters The basic procedure varies based on the method used to determine inter-cluster distances or similarities Different methods results in different variants of the algorithm Single link Complete link Average link Ward’s method Etc.

Single Link Method The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster It can find arbitrarily shaped clusters, but It may cause the undesirable “chain effect” due to noisy points Two natural clusters are split into two

Distance between two clusters Single-link distance between clusters Ci and Cj is the minimum distance between any object in Ci and any object in Cj The distance is defined by the two most similar objects 1 2 3 4 5

Complete Link Method The distance between two clusters is the distance of two furthest data points in the two clusters It is sensitive to outliers because they are far away

Distance between two clusters Complete-link distance between clusters Ci and Cj is the maximum distance between any object in Ci and any object in Cj The distance is defined by the two least similar objects 1 2 3 4 5

Average link and centroid methods Average link: A compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters. Centroid method: In this method, the distance between two clusters is the distance between their centroids

Distance between two clusters Group average distance between clusters Ci and Cj is the average distance between objects in Ci and objects in Cj The distance is defined by the average similarities 1 2 3 4 5

Clustering Basic Concepts and Algorithms Bamshad Mobasher DePaul University