Clustering.

Name: Clustering.
Uploaded: 2017-09-29T14:49:19+00:00
Duration: PTM18S8
Channel: Betty Scott
Description: Clustering.

Clustering

Problem Formulation Given a set of records, organize
the records into clusters (classes) Clustering: the process of grouping physical or abstract objects into classes of similar objects

What is a cluster? 1. A cluster is a subset of records which are “similar” 2. A subset of records such that the distance between any two records in the cluster is less than the distance between any record in the cluster and any record not in it. 3. A connected region of a multidimensional space containing a relatively high density of records.

Another Formulation Given points in some multidimensional space – group the points into small number of clusters Skycat project clustered 2•109 sky objects into stars, galaxies, quasars, etc. Each object was a point in the 7-dimensions space, with each dimension representing radiation in one band of the spectrum. Documents may be thought of as points in a high-dimension space, where each dimension corresponds to one possible word. The position of a document in a dimension is the number of times the word occurs in the document. Clusters of documents correspond to groups of documents on the same topic.

Distance Measures To discuss whether a set of points is close enough to be considered a cluster, we need a distance measure - D(x, y) The usual axioms for a distance measure D are: D(x, x) = 0 D(x, y) = D(y, x) D(x, y)  D(x, z) + D(z, y) the triangle inequality

Distance Measures Assume a k-dimensional Euclidean space, the distance between two points, x=[x1, x2, ..., xk] and y=[y1, y2, ..., yk] may be defined using one of the measures: Euclidean distance: ("L2 norm") Manhattan distance: ("L1 norm") Max of dimensions: ("L norm")

Distance Measures Minkowski distance:
When there is no Euclidean space in which to place the points, clustering becomes more difficult: Web page accesses, DNA sequences, customer sequences, categorical attributes, documents, etc.

Other Distance Measures
Web pages: points in a multidimensional space where each dimension corresponds to one word. Idea: to base the distance D(x, y) on the dot product of vectors corresponding to x and y. The length of the vector is equal to the square root of the sum of the squares of the numbers of occurrences of each word. E.g. 4 words of interest: x=[2,0,3,1] y=[5,3,2,0] xy=16 the length of x=14 the length of y=38

D(x, y)=1 - xy/|x|*|y| D(x, y) = Let us assume that x=[a1, a2, a3, a4] and y=[2a1, 2a2, 2a3, 2a4] (document y is a copy of x), then D(x, y) = 0

DNA sequences, Web access patterns: character sequences may be similar even though they have different lengths and different characters on the same positions: x= abcde y= bcdxye Distance measure: D(x, y) = |x| + |y| - 2|LCS(x, y)| where LCS stands for the longest common subsequence (LCS(x,y) = bcde) so, D(x, y) = 3

Binary Variables How to compute the similarity (or dissimilarity) between objects described by binary variables: Approach: construct a dissimilarity matrix from the binary data: q – the number of variables that equal 1 for both objects r – ... equal 1 for i, and equal 0 for j s – ... equal 0 for i, and equal 1 for j t equal 0 for both objects p=q+r+s+t – the total number of variables

Binary Variables Symmetric binary variable: a binary variable is symmetric if both of its states are equally valuable and carry the same weight (e.g. gender). The dissimilarity between objects i and j is defined as: Asymmetric binary variable: a binary variable is asymmetric if the outcomes of the states are not equally important (e.g. the posotive and negative outcomes of a diseases test). The dissimilarity between objects i and j is defined as:

Binary Variables Consider the following table containing patient records:

Categorical Variables
A categorical variable is a generalization of the binary variable: it can take on more than two states (income: high, medium, low) The dissimilarity (similarity) between objects i and j described by categorical variables can be computed using the simple matching approach: where p is the total number of variables, m is the number of matches, n is the number of variables for which i and j are in different states.

Clustering Methods (1) Partitioning methods: given k, the number of clusters to construct, a partitioning method creates an initial partitioning. Then, it uses an iterative realocation technique that attempts to improve the partitioning by moving objects from one group to another. To achieve global optimality, partitioning based clustering would require the exhaustive enumeration of all of the possible partitions. Two popular heuristics: the k-means algorithm the k-medoids algorithm

Clustering Methods (2) Hierarchical methods:
divisive approach (top-down approach): starts with all objects in the same cluster, and then, in each successive iteration step, a cluster is split up into smaller clusters. agglomerative approach (bottom-up approach): starts with each object forming a separate cluster. It successively merges the objects the objects or groups of objects until a termination condition holds.

Clustering Methods (3) Density-based methods: the given cluster is growing as long as the density of objects in the „neighborhood” exceeds some threshold; i.e. for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points (DBSCAN, OPTICS) Grid-based methods: the object space is divided into a finite number of cells that form a grid structure. All of the clustering operations are performed on the grid structure (STING, CLIQUE). Model-based methods: these methods hypothesize a model for each cluster and find the best fit of the data to the given model.

Partitioning Methods The k-means algorithm:
input: the number of clusters k a a database containing n objects output: a set of k clusters that minimizes the squared-error criterion arbitrarily choose k objects as the initial cluster centers; repeat reassign each object to the cluster to which the object is the most similar based on the mean value of the objects in the cluster; update the cluster means, i.e. calculate the mean value of the objects for each cluster; until no change;

The k-means algorithm - example
Consider 5 points in 2-dimensional space (k=2): 1(1,4), 2(4,2), 3(1,3), 4(3,1), 5(3,4). Suppose we assign the points 1, 2, 3, 4, and 5 in that order.

The k-means algorithm The algorithm attempts to determine k partitions that minimizes the squared-error function. It works well when the clusters are compact clouds that are rather well separated from one another. The algorithm cannot be applied to data with categorical attributes. It is suitable for discovering clusters with nonconvex shapes or clusters with very different sizes. It is very sensitive to noise and outlier points since an object with an extremely large value may substantially distort the distribution of data.

The k-medoids algorithm
Instead of taking the mean value of the objects in a cluster as a reference point, the medoid can be used, which is the most centrally located object in a cluster. The basic strategy: find k clusters in n objects by first arbitrarily finding a representative object (the medoid) for each cluster. Each remaing object is clustered with the medoid to which it is the most similar. The strategy iteratively replaces one of the medoids by one of the nonmedoids as long as the quality of the resulting clustering is improved. The quality is estimated usign a cost function that measure the average dissimilarity between an object and the medoid of its cluster.

The k-medoids algorithm
To determine whether a nonmedoid object orandom, is a good replacement for a current medoid, oj, the following four cases are examined for each of the nonmedoid objects, p. case 1: p currently belongs to medoid oj. If oj is replaced by orandom as a medoid and p is closest to one of oi, ij, then p is reassigned to oi. case 2: p currently belongs to medoid oj. If oj is replaced by orandom as a medoid and p is closest to orandom, then p is reassigned to orandom. case 3: p currently belongs to medoid oi. If oj is replaced by orandom as a medoid and p is closest to oi, then the assignement does not change. case 3: p currently belongs to medoid oi. If oj is replaced by orandom as a medoid and p is closest to orandom, then p is reassigned to orandom.

PAM PAM (Partitioning around Medoids) – it attempts to determine k partitions for n objects. After an initial random selection o k medoids, the algorithm repeatedly tries to make a better choice of medoids. All od the possible pairs of objects are analyzed, where one object in each pair is considered a medoid and the other is not. The quality of the resulting clustering is calculated for each such combination. The set of best objects for each cluster in one iteration forms the medoids for the next iteration. For large values of n and k, such computation become very costly.

CLARA Sampling-based method – CLARA (Clustering Large Applications)
The idea is as follows: instead of taking the whole set of data into consideration, a small portion of the actual data is chosen as a representative of the data. Medoids are then chosen from this sample using PAM. If the sample is selected in a fairly random maner, it should closly represent the original data set.

Hierarchical methods A hierarchical lustering method works by grouping data objects into a tree of clusters. divisive approach (top-down approach): starts with all objects in the same cluster, and then, in each successive iteration step, a cluster is split up into smaller clusters. agglomerative approach (bottom-up approach): starts with each object forming a separate cluster. It successively merges the objects the objects or groups of objects until a termination condition holds.

Hierarchical methods In either agglomerative or divisive hierarchical clustering, the user can specify the desired number of clusters as a termination condition. Four widely used measures for distance between clusters are as follows, where |p – p’| is the distance between two objects or points p and p’, mi is the mean for cluster Ci, and ni is the number of objects in Ci

Measures for Distance maximum distance: mean distance:
minimum distance: maximum distance: mean distance: average distance: is the mean for cluster is the number of points in

Mining Time-Series

Mining Time-Series A time-series database consists of sequences of values or events changing in time. The values are typically measured at equal time intervals Applications: Find customers with similar pattern of electricity consumption Identify companies with similar pattern of growth Find stocks with similar price movements Determine products with similar selling patterns How can we study time-series data?

Time-Series Data Mining: Issues Studied
Similarity search: Subsequence analysis Subsequence matching: normalization + matching Template specification: shape and macro specification Trend and deviation analysis Find trend (data evolution regularity) and deviations Regression analysis, visualization techniques Sequential pattern analysis Sequential association rules Periodicity analysis full vs. partial periodicity cyclic association rules.

Analysis of Similar Time Sequences
Given: a database consisting of time-series sequences Two categories of similarity queries: whole matching vs. subsequence matching Find a sequence that is similar to the query sequence Find all pairs of similar sequences Typical Applications Financial market Market basket data analysis Scientific databases Medical Diagnosis

Time-Series Similarities
Given a database of time-series. Group “similar” time-series time time Investor Fund A Investor Fund B

Categories of Similarity Queries
Whole pattern matching: the sequences to be compared have the same length. Subsequence pattern matching: the query is smaller, find a subsequence in the large sequence that best matches the query sequence.

Basic Approach Extract k features from every sequence
The sequence becomes a point in k-dimensional space Apply a multidimensional index to store and search points.

Problems with transformation
The features extraction method should ensure that a few features are sufficient to differentiate between objects How to extract features and how to guarantee that we do not miss any qualifying object? Objects should be mapped to points in k-dimensional space such that the distance (e.g. Euclidean distance) in the space is less or equal to the real distance between the two objects

Techniques Take Euclidean distance as the similarity measure
Two popular data-independent transformations are: Discrete Fourier Transformation (DFT) Discrete Wavelet Transformation (DWT) Use Discrete Fourier Transformation (DFT) coefficients of each sequence in the database Build a multidimensional index using the first fc coefficient. Each sequence is now a point in a 2*fc dimensional space. Use R*-trees as the indexing structure.

Discrete Fourier Transform (DFT)
Fourier Series: any time sequence can be represented by an infinite sum of sinusoids Distance preserving linear transformation If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance. By Parseval’s Theorem However, we need post-processing to compute actual distance and discard false matches.

Parseval’s Theorem If is the Discrete Fourier Transform of the sequence, then Since DFT is a linear transformation, Parseval’s theorem gives Thus, the distance between two signals and in the time domain is the same as their Euclidean distance in the frequency domain =

Clustering.

Similar presentations

Presentation on theme: "Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering.

Similar presentations

Presentation on theme: "Clustering."— Presentation transcript:

Similar presentations

About project

Feedback