CLUSTERING.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Clustering II.
4. Clustering Methods Concepts Partitional (k-Means, k-Medoids)
What is Cluster Analysis
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Cluster Analysis (1).
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CLUSTERING (Segmentation)
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
9/03Data Mining – Clustering G Dong (WSU) 1 4. Clustering Methods Concepts Partitional (k-Means, k-Medoids) Hierarchical (Agglomerative & Divisive, COBWEB)
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
DATA MINING Spatial Clustering
More on Clustering in COSC 4335
Clustering in Ratemaking: Applications in Territories Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
CS 685: Special Topics in Data Mining Jinze Liu
Data Mining: Concepts and Techniques Clustering
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
K-means and Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
CSE572, CBS598: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
CS 485G: Special Topics in Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Data Mining: Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
What Is Good Clustering?
Clustering Wei Wang.
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Text Categorization Berlin Chen 2003 Reference:
Group 9 – Data Mining: Data
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Lecture 10 Clustering.
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

CLUSTERING

Introduction A cluster is a collection of data objects that have similarity with objects within the same cluster and have dissimilarity with the objects of other clusters. Clustering : is used in biology to develop new plants and animal taxonomies. Is used in business to enable marketers to develop new distinct groups of their customers and characterize the customer group on basis of their purchasing. Is used in the identification of groups of automobiles insurance policy consumer. Is used in the identification of groups of house in a city on the basis of house type, their cost and geographical location. Is used to classify the document on the web for information discovery.

The basic requirements of the cluster analysis are: ·       Dealing with Different Types of Attributes: Many clustering algorithms are developed to cluster interval-based (numerical) data. While many applications require other types of data such as binary data, nominal data, ordinal data, or mixture of these data types. ·       Dealing with Noisy Data: Generally a database consists of outliers, missing, unknown, or erroneous data. But many clustering algorithms are too sensitive to such type of data. ·       Constraints on Clustering: Many application need to perform clustering under various constrains. For example, you want to search the location of a person in a city on internet. To search the location of a person, you may use the cluster households that with considering constraints such as the street, area, or house number for city.

· Dealing with Arbitrary Shape: Many clustering algorithms determine the clusters using Euclidean or Manhattan distance measures. The algorithms based on these measure distance find spherical clusters with similar size and density. But the shape of cluster is not same, so you have to need to require an algorithm that can work with arbitrary shapes. ·  High Dimensionality: A database or data warehouse consists various dimensions and attributes. Some clustering algorithms are well working with low-dimensional data that contains two or three-dimensional. For high-dimensional data you require new algorithms. ·   Ordering of Input Data: Some clustering algorithms are sensitive to the order of input data. For example, the same set of data are input with different ordering, there may arise some error.

·Interpretability and Usability: The result of the clustering algorithms should be interpretable, comprehensible, and usable. ·Determining Input Parameter: Many clustering algorithms require user to input some parameter on cluster analysis run time. The clustering result is so sensitive about the input parameter. Parameters are very hard to determine for high-dimensional objects. Scalability: Many clustering algorithms work well on small data sets that consists of fewer than 250 data objects. While a large database consists of millions of objects. So you have to need a highly scalable clustering technique

Types of Data Data Matrix : represented in the form of a relational table n X p matrix . Where , n – represents objects p – variables. -Dissimilarity Matrix: represented in the form of a relational table of nXn matrix, Where n – represents the objects.

Interval Scaled Variables continuous measurement of linear scale For e.g height and weight. E.g suppose you are measuring the weather temperature in Celsius or Fahrenheit and it is difficult to change the measurement from Celsius to Fahrenheit in cluster analysis. So In order to avoid dependency of the measurement unit, always use a standard data I.e unit-less. There are two steps to convert the original measurement unit to unit less variables: 1.Calculating the mean absolurte deviation , Sf using this formula: Sf = 1 \ n (|X1f -Mf|+|X2f-Mf|+..+|Xnf-Mf|)

Where, X1f…Xnf - n measurements of f Mf – mean value of f that is equal to simple mean : Mf = (X1f+X2f+..+Xnf) / n 2. Calculate the standard measurement using the formula: Zif = (Xif - Mf) / Sf Where , Zf – standard measurement. After calculating Z-score , you can compute the dissimilarity between the objects using one of these distance techniques: Euclidean Distance: it is the geometric distance between multidimensional spaces .

d(i,j) = {n|Xi-Xj|q } 1/q Dissimilarity is calculated by, d(i,j) = {n(Xi-Xj)} 1/2 Manhattan Distance: average difference of the various dimension objects. Dissimilarity is calculated by, d(i,j) = n|Xi-Xj| Minkowski Distance: generalization of both Euclidean distance and Manhattan Distance Dissimilarity is calculated by, d(i,j) = {n|Xi-Xj|q } 1/q

Binary Variable : Represents two states 0 and 1, When state is 0 – variable is absent 1 – variable is present Two types of Binary Variables: symmetric asymmetric

-Nominal , Ordinal, Ratio-Scaled variables -Mixed Variables

Partitioning Method The k-means Method 1.Select k objects randomly that represents a cluster for which cluster mean or centre of gravity of a cluster. 2.Assign an object to the cluster for which cluster mean is most similar based on dissimilar distance. 3.Update cluster mean of each cluster, if necessary 4.Repeat this process until the criterion function converges all the clusters. E = k pci | p – mi|2 i =1

k-mediods Method Randomly select k objects that represent reference point, mediods. Assign remaining object to a cluster that is most similar to its mediod. Randomly select a non-mediod object Orandom. Calculate the total cost C of swapping the mediod Oi with non-mediods object Orandom and Oi. Swap mediod Oi with non-mediods object Orandom to make new medoids , if total cost of swapping is negative. Process is again started from step 2 and this process is repeated until no swapping occurs.

Hierarchical Method Agglomerative and Divisive Hierarchical Clustering In Agglomerative clustering each object creates its own clusters. The single clusters are merged to make larger cluster and the process of merging continues until all the singular clusters are merged into one big cluster that consist of all the objects. While In divisive hierarchical clustering method all objects are arranged within a big single cluster and the large cluster is continuously divided into smaller clusters until each cluster has single object.

Agglomerative 1 2 3 4 5 6 4 5 6 1 2 3 4 5 1 2 4 5 6 1 2 3 Divisive Level 0 1 2 3 4 5 6 Level 5 Level 1 Level 4 4 5 6 Level 2 1 2 3 Level 3 Level 2 Level 3 4 5 1 2 Level 1 Level 4 4 5 6 1 2 3 Level 5 Level 0 Divisive

Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) Clustering Feature Tree Clustering Feature CF = (N,LS,SS) N- no. of objects in a sub-cluster. LS- Linear summation of N objects SS – square of summation of N data objects

Limitation: Working: - it is suitable for only spherical cluster BIRCH examines the data set to build an initial in-memory CF tree that represents the multilevel compression of the data objects in the form of clustering features. BIRCH selects a clustering algorithm to cluster the leaf-nodes of the CF tree. Limitation: - it is suitable for only spherical cluster

CURE Clustering using representatives: Working Steps: Randomly create a sample of all the objects that are contained in a data set. Divide the sample into definite set of partitions. Simultaneously, create partial cluster in each partitions. Apply random sampling to remove outliers. If a cluster grows slowly, then delete this cluster from the partition. Cluster all the partially created clusters in each patition using shrinking factor.It means all the representative points formed a new cluster by moving towards the cluster centre that is specified by user defined fraction, shrinking factor.

- Label the form cluster

Density Based Method DBSCAN working steps: Check the -neighbourhood for each data points in a data set. Create core objects , if the -neighbourhood of this object contains MinPts data points. Collect all the objects that are directly density-reachable from these core objects. Merge some of these objects, which are directly density-connected objects, to form new cluster. Terminate this process , when no other data point can be added to any cluster.