Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
What is Cluster Analysis?
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Data Mining Techniques: Clustering
Clustering II.
Clustering.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering II.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis.
CLUSTERING (Segmentation)
What is Cluster Analysis?
Cluster Analysis Part I
Advanced Database Technologies
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Presented by Ho Wai Shing
Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Density-Based.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
GROUP 6 KIIZA FELIX 2013/BIT/110 MUHANGUZI EUSTUS 2013/BIT/104/PS TUGIROKWIKIRIZA FLAVIA 2013/BIT/111/PS HAMSTONE NATOSHA 2013/BIT/122/PS GILBERT MUMBERE.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Data Mining--Clustering
Slides by Eamonn Keogh (UC Riverside)
Ke Chen Reading: [7.3, EA], [9.1, CMB]
CSE 5243 Intro. to Data Mining
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
CSE 5243 Intro. to Data Mining
Self organizing networks
Fuzzy Clustering.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Cluster Analysis What is Cluster Analysis?
DATA MINING Introductory and Advanced Topics Part II - Clustering
Data Mining: Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
CSE572, CBS572: Data Mining by H. Liu
What Is Good Clustering?
Clustering Wei Wang.
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Presentation transcript:

Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: Han & Kamber “Data Mining: Concepts and Techniques” book, slides by Han,  Han & Kamber, adapted and supplemented by Guha

Chapter 7: Cluster Analysis

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Clustering: Rich Applications and Multidisciplinary Efforts Pattern Recognition Spatial Data Mining Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science Market research WWW Document classification Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. Fraud detection – outliers ! City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Measure the Quality of Clustering Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for numeric, boolean, categorical and ordinal variables. Numeric: income, temperature, price, etc. Boolean: Yes/no, e.g, student? citizen? Categorical: color (red, blue, green, …), nationality, etc. Ordinal: Excellent/Very good…, High/medium/low (i.e., with order) It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.

Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

Major Clustering Approaches Partitioning approach: Given n objects in the database, a partitioning approach splits it into k groups. Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using one of two methods: Agglomerative (bottom-up): start with each object as a separate group; successively, merge groups that are close until a termination condition holds. Divisive (top-down): start with all objects in one group; successively split groups that are not “tight” until a termination condition holds. Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Km, so as to minimize the sum of squared errors where is the cluster leader or representative (which itself may or may not belong to the database D). Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster If there was no restriction on the number k of clusters, how could we minimize SSE?! SSE =

1. The K-Means Clustering Method Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) assign each object to the cluster whose center is closest to the object (4) update the cluster centers as the mean value of the objects in each cluster; (5) until no change;

The K-Means Clustering Method Example 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means

K-Means Clustering Method: Example with 8 points on a line and k = 2 1 2 2.5 3 3.6 4 5.3 8 9 9.5 10 11 Compute new cluster leaders = cluster means. Cluster to nearest cluster leader. Randomly choose 2 objects as cluster leaders. No change = Exit!

Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as deterministic annealing Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

How to choose k, the number of clusters? Rule of thumb: k ≈√(n/2) Elbow method: Plot SSE vs. number of clusters. Choose k where the SSE starts leveling off, i.e., from where there is not much gain in adding another cluster. E.g., below the choice would be k = 6, (thanks Daniel Martin, Quora) SSE =

Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method

What Is the Problem with the K-Means Method? The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the objects in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 1 2 3 4 5 6 7 8 9 10

More Partitioning Clustering Algorithms K-medoids (PAM = Partitioning Around Medoids) CLARA (Custering LARge Applications) CLARANS (Clustering Large Applications based on RANdomized Search) Read above three clustering methods from the paper Efficient and Effective Clustering Methods for Spatial Data Mining, by Ng and Han, Intnl. Conf. on Very Large Data Bases (VLDB’94), 1994, which proposes CLARANS, but has a good presentation of PAM and CLARA as well.

Hierarchical Clustering Algorithms ROCK ROCK: A Robust Clustering Algorithm for Categorical Data, by (Sudipto) Guha, Rastogi and Shim, Information Systems, 2000. Main slides: ROCK slides by the authors Related slides: ROCK slides by Olusegun et al

Hierarchical Clustering Algorithms DBSCAN A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, by Ester, Kriegel, Sander and Xu, Intnl. Conf. Knowledge Discovery And Data Mining (KDD’96), 1996.

DBSCAN Example 1 What are the clusters if 1. Eps = 2, MinPts = 7? Use Manhattan distance!

DBSCAN Example 2 What are the clusters if Eps = 1, MinPts = 4? Use Manhattan distance! Note: The middle point must belong to both clusters if we follow the definition. So clusters can overlap!

Hierarchical Clustering Algorithms CLIQUE Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, by Agrawal, Gehrke, Gunopulos and Raghavan, ACM-SIGMOD Intnl. Conf. on Management of Data (SIGMOD’98), 1998.