Download presentation
Presentation is loading. Please wait.
Published byBenoît Laurent Modified over 6 years ago
1
CSCI N317 Computation for Scientific Applications Unit 3 - 4 Weka
Cluster Analysis
2
What is Cluster Analysis?
The purpose of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is also called data segmentation in some applications because clustering partition large data sets into groups according to their similarity. Clustering can be used for outlier detection, where outliers may be more interesting than common cases. Unsupervised learning, learning by observation
3
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location
4
Types of Data in Cluster Analysis
Typically operate on either of the following two data structures: Data matrix: represents n objects such as persons, with p variables (measurements or attributes), such as age, height, weight and so on. The structure is in the form of an n-by-p matrix
5
Types of Data in Cluster Analysis
Dissimilarity matrix: Stores a collection of proximities that are available for all pairs of n objects. The structure is in the form of an n-by-n matrix, where d(i,j) is the measured difference or dissimilarity between objects i and j. In general, d(i,j) is a nonnegative number that is close to 0 when object i and j are highly similar or “near” each other, and becomes larger the more they differ. d(i,j) = d(j,i) d(i,i) = 0
6
Types of Data in Cluster Analysis
Many clustering algorithm operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying clustering algorithms The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. Different algorithms are developing for computing distances between different types of variables and objects with mixed-type variables.
7
Cautions for Interval-Scaled Variables
Continuous measurements of a roughly linear scale. E.g. weight, height, latitude and longitude coordinates, temperature The measurement units used can affect the clustering analysis. E.g. changing measurements from meters to inches for height may lead to very different clustering structure. Therefore data need to be standardized.
8
Cautions for Interval-Scaled Variables
Standardize data – convert the original measurements of variable f to unitless variables Calculate the mean absolute deviation, Sf: xif, …., xnf are n measurements of f, mf is the mean of f Calculate the standardized measurement or z-score: Whether and how to perform standardization is the choice of the user.
9
Variables of Mixed Types
Example: calculate the dissimilarity using all tests After distance computation, the resulting matrix is Thus, object 1 and 4 are most similar.
10
Clustering Methods Partitioning methods
Given a database of n objects, constructs k partitions of the data, where each partition represents a cluster and k<=n. Each group must contain at least one object, and each object must belong exactly to one group. A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.
11
Clustering Methods Hierarchical methods
Creates a hierarchical decomposition of the given set of data objects. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one, or until a termination condition holds. The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is one cluster, or until a termination condition holds.
12
Clustering Methods The choice of clustering algorithm depends both on the type of data available and on the particular purpose of the application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose.
13
Partitioning Methods Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k<=n), where each partition represents a cluster. K-means method Cluster similarity is measured in regard to the mean values of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
14
Partitioning Methods Example
15
Partitioning Methods Example Let k = 3.
Arbitrarily choose three objects as the three initial cluster centers, where centers are marked by a “+”. Each object is distributed to a cluster based on the center to which it is the nearest. The cluster centers are updated by calculating the new mean based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Eventually no redistribution of the objects in any cluster occurs, and so the process terminates.
16
Partitioning Methods K-means method can only be applied when the mean of a cluster is defined, thus cannot be used to data with categorical attributes. K-modes method – extends the k-means paradigm to cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a frequency-based method to update modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with mixed numeric and categorical values. K-means method is sensitive to noise and outlier data points because a small number of such data can substantially influence the mean value
17
Hierarchical Methods Group data objects into a tree of clusters.
Two types: Agglomerative – bottom-up strategy, placing each objects in its own cluster and merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Divisive – top-down strategy, starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained.
18
Hierarchical Methods
19
Outlier Discovery What are outliers?
The set of objects are considerably dissimilar from the remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ... Problem: Define and find outliers in large data sets Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
20
Videos Weka https://www.youtube.com/watch?v=HCA0Z9kL7Hg
R Data site used in video - Data file on Canvas: iris.csv Sample code on Canvas: clusterAnalysis.R
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.