Download presentation
Presentation is loading. Please wait.
1
Clustering different types of data
Pasi Fränti
2
Data types Numeric Binary Categorical Text Time series
3
Part I: Numeric data
4
Distance measures Type Possible operations Example variable
Example values Nominal == Major subject Computer science Mathematics Physics Ordinal ==, <, > Degree Bachelor Master Licentiate Doctor Interval ==, <, >, - Temperature 10 °C 20 °C 10 °F Ratio ==, <, >, -, / Weight 0 kg 10 kg 20 kg
5
Definition of distance metric
A distance function is metric if the following conditions are met for all data points x, y, z: All distances are non-negative: d(x, y) ≥ 0 Distance to point itself is zero: d(x, x) = 0 All distances are symmetric: d(x, y) = d(y, x) Triangular inequality: d(x, y) d(x, z) + d(z, y)
6
Common distance metrics
Xj = (xj1, xj2, …, xjp) dij = ? Minkowski distance Euclidean distance q = 2 Manhattan distance q = 1 Xi = (xi1, xi2, …, xip) 1st dimension 2nd dimension pth dimension
7
Distance metrics example
5 10 2D example x1 = (2,8) x2 = (6,3) Euclidean distance Manhattan distance X1 = (2,8) 5 X2 = (6,3) 4
8
Chebyshev distance In case of q , the distance equals to the maximum difference of the attributes. Useful if the worst case must be avoided: Example:
9
Hierarchical clustering Cost functions
Three cost functions exist: Single linkage Complete linkage Average linkage
10
Single Link The smallest distance between vectors in clusters i and j:
xi Min distance xj
11
Complete Link The largest distance between vectors in clusters i and j: Cluster 1 Cluster 2 xj Max distance xi
12
Average Link The average distance between vectors in clusters i and j:
Av. distance xj xi
13
Cost function example [Theodoridis, Koutroumbas, 2006]
1.1 1 1.2 1.3 1.4 1.5 x1 x2 x3 x4 x5 x6 x7 Data Set Single Link: Complete Link: x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7
14
Part II: Binary data
15
Hamming Distance (Binary and categorical data)
Number of different attribute values. Distance of ( ) and ( ) is 2. Distance ( ) and ( ) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube
16
Hard thresholding of centroid
(0.40, 0.60, 0.75, 0.20, 0.45, 0.25)
17
Hard and soft centroids
Bridge (binary version)
18
Distance and distortion
General distance function: Distortion function:
19
Distortion for binary data
Cost of a single attribute: The number of zeroes is qjk, the number of ones is rjk and cjk is the current centroid value for variable k of group j.
20
Optimal centroid position
Optimal centroid position depends on the metric. Given parameter: The optimal position is:
21
Example of centroid location
22
Centroid location
23
Categorical clustering
Three attributes Director Actor Genre t1 (Godfather II) Coppola De Niro Crime t2 (Good Fellas) Scorsese t3 (Vertigo) Hitchcock Stewart Thriller t4 (N by NW) Grant t5 (Bishop's Wife) Koster Comedy t6 (Harvey)
24
Categorical clustering Sample 2-d data: color and shape
Model A Model B Model C
25
Hamming Distance (Binary and categorical data)
Number of different attribute values. Distance of ( ) and ( ) is 2. Distance ( ) and ( ) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube
26
Histogram-based methods:
K-means variants Methods: Histogram-based methods: k-modes k-medoids k-distributions k-histograms k-populations k-representatives
27
Entropy-based cost functions
Category utility: Entropy of data set: Entropies of the clusters relative to the data:
28
Iterative algorithms
29
K-modes clustering Distance function
Vector and mode A F I A D G Distance +1 2 +1
30
K-modes clustering Prototype of cluster
Vectors Mode A D G B D H A F I A D
31
K-medoids clustering Prototype of cluster
Vector with minimal total distance to others 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5
32
K-medoids Example
33
K-medoids Calculation
34
K-histograms D 2/3 F 1/3
35
K-distributions Cost function with ε addition
36
Example of cluster allocation Change of entropy
37
Problem of non-convergence Non-convergence
38
Results with Census dataset
39
Literature Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), , March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical data clustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp , Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp , 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp , 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp , Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/ ,
40
Part IV: Text data
41
Applications of text clustering
Query relaxation Spell-checking Automatic categorization Document clustering
42
Query relaxation Current solution Matching suffixes from database
Alternate solution From semantic clustering
43
Spell-checking Word kahvila (café): one correct
two incorrect spellings
44
Automatic categorization
Category by clustering
45
Document clustering Motivation: Clustering Process:
Group related documents based on their content No predefined training set (taxonomy) Generate a taxonomy at runtime Clustering Process: Data preprocessing: tokenize, remove stop words, stem, feature extraction and lexical analysis Define cost function Perform clustering 45
46
Text clustering String similarity is the basis for clustering text data A measure is required to calculate the similarity between two strings
47
String similarity Semantic: Syntactic: car and auto
automobile and auto отель and готель sauna and sana
48
Semantic similarity Lexical database: WordNet
object artifact conveyance, transport article ware vehicle bike, bicycle cutlery, eating utensil truck table ware fork instrumentality wheeled vehicle car, auto automotive, motor English Relations via generalization Sets of synonyms (synsets)
49
Similarity using WordNet [Wu and Palmer, 2004]
Input : word1: wolf , word 2: hunting dog Output: similarity value = 0.89
50
Hierarchical clustering by WordNet
Need better
51
Syntactic similarity Need examples
Operates on the words and their characters Can be divided into three components: Character-level similarity measures Matching Techniques Token similarity Need examples
52
Syntactic similarity workflow
53
Character-level measures
Treat strings as a sequence of characters Determine the similarity by one of three ways Exact match Transformation Longest common substring Use these examples also later! The Point 3 Tigne Point 1 No topless and other restrictions Tigne Point mall Tigne Point Blokker and other shops The Avenue Acqua terra e mare Lonely tree between houses 2 ? Golden house Chinese restaurant The Palace
54
Exact match Machine Learning Machine Learning Machine Learning
Binary result: 1 = if the strings are identical 0 = otherwise Machine Learning Machine Learning Machine Learning Machine Learned 1 (match) 0 (mismatch)
55
Transformation Edit distance: Single edit operations (insertion, deletion, substitution) to transfer a string into another Hamming: Allows only substitutions. Length of the strings must be equal Jaro/Winkler: Based on the number of matching and transposed characters (a/u, u/a)
56
Levenshtein edit distance Example
Input: string 1: kitten, string 2: sitting Output: 3 substitute s with k: sitten substitute e with i: sittin insert g: sitting
57
Longest common substring
Finds the longest contiguous sequence of characters that co-occur in two strings Example 1: Example 2: ABABC AAAAA BABCA LCS =3 ED =2 ED = 2 LCS =1 AXAXA ABCBA
58
String segmentation b i n g o n
Q-grams: divides string into substrings of length q. Tokenization: breaks a string into words and symbols called tokens using whitespace, line breaks, and punctuation characters. The club at the Ivy b i n g o n 2-grams
59
Matching techniques
60
Matching techniques
61
Token similarity Two alternatives to compare tokens: Exact matching:
1 if match, 0 otherwise. Approximate matching: compute similarity between tokens using a character-level measure
62
Approximate matching Example [Monge and Elkan , 1996]
Input: string1: gray color, string 2: the grey colour Output: similarity value 0.85 Pairwise similarities using edit distance (smith-waterman-Gotoh) the grey colour Maximum gray 0.20 0.90 0.30 color 0.80
63
Similarities for sample data
Compared Strings Edit distance Q-gram Q=2 Q-gram Q=3 Q-gram Q=4 Cosine distance Pizza Express Café Pizza Express 72% 79% 74% 70% 82% Lounasravintola Pinja Ky – ravintoloita Lounasravintola Pinja 54% 68% 67 % 65% 63 % Kioski Piirakkapaja Kioski Marttakahvio 47% 45% 33% 32% 50% Kauppa Kulta Keidas Kauppa Kulta Nalle 67% 60% Ravintola Beer Stop Pub Baari, Beer Stop R-kylä 39% 42% 36% 31% Ravintola Foxie s Bar Foxie Karsikko 25% 15% 12% 24% Play baari Ravintola Bar Play – Ravintoloita 21% 17% 8% Different Different
64
Part V: Time series
65
Clustering of time-series
66
Dynamic Time Warping Align two time-series by minimizing distance of the aligned observations Solve by dynamic programming!
67
Example of DTW
68
Prototype of a cluster Sequence c that minimizes E(Sj,c) is called a Steiner sequence. Good approximation to Steiner problem, is to use medoid of the cluster (discrete median). Medoid is such a time-series in the cluster that minimizes E(Sj,c).
69
Calculating the prototype
Can be solved by dynamic programming. Complexity is exponential to the number of time-series in a cluster.
70
Averaging heuristic Calculate the medoid sequence
Calculate warping paths from the medoid to all other time series in the cluster New prototype is the average sequence over warping paths
71
Local search heuristics
72
Example of the three methods
E(S) = 159 E(S) = 138 E(S) = 118 LS provides better fit in terms of the Steiner cost function. It cannot modify sequence length during the iterations. In datasets with varying lengths it might provide better fit, but non-sensitive prototypes
73
Experiments
74
Part VI: Other clustering problems
75
Clustering of GPS trajectories
76
Density clusters Walking street Swim hall Market place Science park
Homes of users Shop
77
Objects of different colors
Image segmentation Objects of different colors
78
Literature S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd edition, 2006. P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids", Signal Processing: Image Communication, 14 (9), 677‑681, 1999. I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, , October Michael Pucher, F.T.W.: Performance Evaluation of WordNet-based Semantic Relatedness Measures for Word Prediction in Conversational Speech (2004) A. E.Monge, and C. Elkan. The Field Matching Problem: Algorithms and Applications. Int. Conf. on Knowledge Discovery and Data Mining, ,1996.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.