Download presentation
Presentation is loading. Please wait.
Published byBernadette Newman Modified over 9 years ago
1
Clustering Cost function Pasi Fränti Clustering methods: Part 4 Speech and Image Processing Unit School of Computing University of Eastern Finland 29.4.2014
2
Numeric Binary Categorical Text Time series Data types
3
Part I: Numeric data
4
Distance measures
5
Definition of distance metric A distance function is metric if the following conditions are met for all data points x, y, z: All distances are non-negative: d(x, y) ≥ 0 Distance to point itself is zero: d(x, x) = 0 All distances are symmetric: d(x, y) = d(y, x) Triangular inequality: d(x, y) d(x, z) + d(z, y)
6
Minkowski distance Euclidean distance q = 2 Manhattan distance q = 1 X i = (x i1, x i2, …, x ip ) d ij = ? X j = (x j1, x j2, …, x jp ) 1 st dimension2 nd dimensionp th dimension Common distance metrics
7
2D example x 1 = (2,8) x 2 = (6,3) Euclidean distance Manhattan distance 0510 5 4 5 X 1 = (2,8) X 2 = (6,3) Distance metrics example
8
Chebyshev distance In case of q , the distance equals to the maximum difference of the attributes. Useful if the worst case must be avoided: Example:
9
Hierarchical clustering Cost functions Single link: the smallest distance between vectors in clusters i and j: Complete-link: the largest distance between vectors in clusters i and j: Average link: the average distance between vectors in clusters i and j:
10
Single Link
11
Complete Link
12
Average Link
14
1.111.21.31.41.5 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 Single Link:Complete Link: Data Set Cost function example [Theodoridis, Koutroumbas, 2006]
15
Part II: Binary data
16
Hamming Distance (Binary and categorical data) Number of different attribute values. Distance of (1011101) and (1001001) is 2. Distance (2143896) and (2233796) Distance between (toned) and (roses) is 3. 3-bit binary cube 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path)
17
Hard thresholding of centroid (0.40, 0.60, 0.75, 0.20, 0.45, 0.25)
18
Hard and soft centroids Bridge (binary version)
19
Distance and distortion General distance function: Distortion function:
20
Distortion for binary data Cost of a single attribute: The number of zeroes is q jk, the number of ones is r jk and c jk is the current centroid value for variable k of group j.
21
Optimal centroid position Optimal centroid position depends on the metric. Given parameter: The optimal position is:
22
Example of centroid location
23
Centroid location
24
Part III: Categorical data
25
Categorical clustering Three attributes
26
Categorical clustering Sample 2-d data: color and shape Model AModel B Model C
27
Categorical clustering k-modes k-medoids k-distributions k-histograms k-populations k-representatives Methods: Histogram-based methods:
28
Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:
29
Iterative algorithms
30
K-modes clustering Distance function
31
K-modes clustering Prototype of cluster
32
K-medoids clustering Prototype of cluster Medoid: Vector with minimal total distance to every other ACEACE BCFBCF BDGBDG 2 2 3 BCFBCF 2+3=52+2=42+3=5
33
K-medoids Example
34
K-medoids Calculation
35
K-histograms D 2/3 F 1/3
36
K-distributions Cost function with ε addition
37
Example of cluster allocation Change of entropy
38
Problem of non-convergence Non-convergence
39
Results with Census dataset
40
Part IV: Text data
41
Applications of text clustering Query relaxation Spell-checking Automatic categorization Document similarity
42
Query relaxation Current solution Matching suffixes from database Alternate solution From semantic clustering
43
Spell-checking Word kahvila (café) but with one correct and two incorrect spellings
44
HELP !!! This is Google translation… Automatic categorization Category by clustering
45
The similarity between every string pair is calculated as a basis for determining the clusters A similarity measure is required to calculate the similarity between two strings. Approximate string matching Semantic similarity String clustering
46
Motivation: –Group related documents based on their content –No predefined training set (taxonomy) –Generate a taxonomy at runtime Clustering Process: –Data preprocessing: remove stop words, stem, feature extraction and lexical analysis –Define cost function –Perform clustering Document clustering
47
Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T. Example: T=“AGCTTGA” P=“GCT” Applications: –Searching keywords in a file –Searching engines (like Google) –Database searching Exact string matching
48
Determine if a text string T of length n and a pattern string P of length m partially matches. –Consider the string “approximate”. Which of these are partial matches? aproximate approximately appropriate proximate approx approximat apropos approxximate –A partial match can be thought of as one that has k differences from the string where k is some small integer (for instance 1 or 2) –A difference occurs if the string1.charAt(j) != string2.charAt(j) or if string1.charAt(j) does not appear in string2 (or vice versa) The former case is known as a revise difference, the latter is a delete or insert difference. What about two characters that appear out of position? For instance, approximate vs. apporximate? Approximate string matching
49
Keanu Reeves Samuel Jackson Arnold Schwarzenegger H. Norman Schwarfkopf Bernard Schwartz … Schwarrzenger Query errors Limited knowledge about data Typos Limited input device (cell phone) input Data errors Typos Web data OCR Similarity functions: Edit distance Q-gram Cosine Approximate string matching
50
Given two strings T and P, the edit distance is the minimum number of substitutions, insertion and deletions, which will transform the string T into P. Time complexity by dynamic programming: O(nm) Edit distance Levenhstein distance
51
tmp 0123 t1012 e2122 m3212 p4321 Dynamic programming: m[i][j] = min{ m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else Edit distance 1974
52
b i n g o n 2-grams Fixed length (q) ed(T, P) <= k, then # of common grams >= # of T grams – k * q Q-grams
53
T = “bingo”, P = “going” gram1 = {#b, bi, in, ng, go, o#} gram2 = {#g, go, oi, in, ng, g#} Total(gram1, gram2) = {#b,bi,in,ng,go,o#,#g, go,oi,in,ng,g#} |common terms difference|= sum{1,1,0,0,0,1,1,0,1,0,0,1} gram1.length = (T.length + (q - 1) * 2 + 1) – q gram2.length = (P.length + (q - 1) * 2 + 1) - q L = gram1.length + gram2.length=12 Similarity = (L- |common terms difference| )/ L =0.5 Q-grams
54
Two vectors A and B,θ is represented using a dot product and magnitude as Implementation: Cosine similarity = (Common Terms) / (sqrt(Number of terms in String1) + sqrt(Number of terms in String2)) Cosine similarity
55
T = “bingo right”, P = “going right” T1 = {bingo right}, P1 = {going right} L1 = unique(T1).length; L2 = unique(P1).length; Unique(T1&P1) = {bingo right going} L3 = Unique(T1&P1).length; Common terms = (L1+L2)-L3; Similarity = common terms / (sqrt(L1)*sqrt(L2)) Cosine similarity
56
Similar with cosine similarity Dices coefficient = (2*Common Terms) / (Number of terms in String1 + Number of terms in String2) Dice coefficient
57
Compared Strings Edit distance Q-gram Q=2 Q-gram Q=3 Q-gram Q=4 Cosine distance Pizza Express Café Pizza Express 72%79%74%70%82% Lounasravintola Pinja Ky – ravintoloita Lounasravintola Pinja 54%68%67 %65%63 % Kioski Piirakkapaja Kioski Marttakahvio 47%45%33%32%50% Kauppa Kulta Keidas Kauppa Kulta Nalle 68%67%63 %60%67% Ravintola Beer Stop Pub Baari, Beer Stop R-kylä 39%42%36%31%50% Ravintola Foxie s Bar Foxie Karsikko 31%25%15%12%24% Play baari Ravintola Bar Play – Ravintoloita 21%31%17%8%32% Similarities for sample data Different
58
Thesaurus-based WordNet
59
An extensive lexical network for the English language Contains over 138,838 words. Several graphs, one for each part-of-speech. Synsets (synonym sets), each defining a semantic sense. Relationship information (antonym, hyponym, meronym …) Downloadable for free (UNIX, Windows) Expanding to other languages (Global WordNet Association) Funded >$3 million, mainly government (translation interest) Founder George Miller, National Medal of Science, 1991. wet dry watery moist damp parched anhydrous arid synonym antonym WordNet
60
object artifact conveyance, transport article ware vehicle bike, bicycle cutlery, eating utensil truck table ware fork instrumentality wheeled vehicle car, auto automotive, motor Example of WordNet
61
Entity (40%) Inanimate-object (17 %) Natural-object (1.6%) Geological-formation (0.17%) Natural-elevation (0.011%) Shore ( 0.008%) Hill (0.0019%)Coast (0.002%) Examples of probabilities
62
Hierarchical clustering by WordNet
63
Word Pair Human Judgment Edge Counting Based Measures Information Content Based Measures PathWUPLINJiang&Conrath CarAutomobile3.921 111 GemJewel3.841 111 JourneyVoyage3.840.970.920.840.88 BoyLad3.760.970.930.860.88 CoastShore3.700.970.910.980.99 AsylumMadhouse3.610.970.940.97 MagicianWizard3.501111 MiddayNoon3.421111 FurnaceStove3.110.810.460.230.39 FoodFruit3.080.810.220.130.63 BirdCock3.050.970.940.60.73 BirdCrane2.970.920.840.60.73 Performance of WordNet
64
Part V: Time series
65
Clustering of time-series
66
Align two time-series by minimizing distance of the aligned observations Solve by dynamic programming! Dynamic Time Warping
67
Example of DTW
68
Sequence c that minimizes E(S j,c) is called a Steiner sequence. Good approximation to Steiner problem, is to use medoid of the cluster (discrete median). Medoid is such a time-series in the cluster that minimizes E(S j,c). Prototype of a cluster
69
Can be solved by dynamic programming. Complexity is exponential to the number of time-series in a cluster. Calculating the prototype
70
Calculate the medoid sequence Calculate warping paths from the medoid to all other time series in the cluster New prototype is the average sequence over warping paths Averaging heuristic
71
Local search heuristics
72
E(S) = 159 E(S) = 138E(S) = 118 LS provides better fit in terms of the Steiner cost function. It cannot modify sequence length during the iterations. In datasets with varying lengths it might provide better fit, but non-sensitive prototypes Example of the three methods
73
Experiments
74
Part VI: Other clustering problems
75
Clustering of GPS trajectories
76
Density clusters Homes of users Shop Walking street Market place Swim hall Science park
77
Image segmentation Objects of different colors
78
Literature 1.S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd edition, 2006. 2.P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids", Signal Processing: Image Communication, 14 (9), 677 ‑ 681, 1999. 3.I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, 3499-3502, October 2004.
79
Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345-366, 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005. Literature
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.