Clustering different types of data

Slides:

Advertisements

Similar presentations

Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

BioInformatics (3).

What is Cluster Analysis?

Clustering Categorical Data The Case of Quran Verses

CLUSTERING PROXIMITY MEASURES

Clustering Cost function Pasi Fränti Clustering methods: Part 4 Speech and Image Processing Unit School of Computing University of Eastern Finland

Introduction to Bioinformatics

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

What is Cluster Analysis

Distance Functions for Sequence Data and Time Series

Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.

Cluster Analysis (1).

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.

Data Clustering 1 – An introduction

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.

1 Clustering Sunita Sarawagi

Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Chapter 2: Getting to Know Your Data

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.

Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Clustering Categorical Data

SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Similarity Measures for Text Document Clustering

Clustering (1) Clustering Similarity measure Hierarchical clustering

Chapter 2: Getting to Know Your Data

Clustering Anna Reithmeir Data Mining Proseminar 2017

What Is Cluster Analysis?

Machine Learning for the Quantified Self

Clustering of Web pages

Lecture 2-2 Data Exploration: Understanding Data

IMAGE PROCESSING RECOGNITION AND CLASSIFICATION

Fast nearest neighbor searches in high dimensions Sami Sieranoja

Ke Chen Reading: [7.3, EA], [9.1, CMB]

Distance Functions for Sequence Data and Time Series

Clustering (3) Center-based algorithms Fuzzy k-means

Topic 3: Cluster Analysis

©Jiawei Han and Micheline Kamber Department of Computer Science

Selected Topics in AI: Data Clustering

Distance Functions for Sequence Data and Time Series

School of Computer Science & Engineering

K Nearest Neighbor Classification

Information Organization: Clustering

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

What Is Good Clustering?

Clustering Wei Wang.

Handwritten Characters Recognition Based on an HMM Model

Data Transformations targeted at minimizing experimental variance

Nearest Neighbors CSC 576: Data Mining.

Text Categorization Berlin Chen 2003 Reference:

Group 9 – Data Mining: Data

Topic 5: Cluster Analysis

CSE572: Data Mining by H. Liu

Fragment Assembly 7/30/2019.

Measuring the Similarity of Rhythmic Patterns

Data Mining: Concepts and Techniques — Chapter 2 —

Presentation transcript:

Clustering different types of data Pasi Fränti 21.3.2017

Data types Numeric Binary Categorical Text Time series

Part I: Numeric data

Distance measures Type Possible operations Example variable Example values Nominal == Major subject Computer science Mathematics Physics Ordinal ==, <, > Degree Bachelor Master Licentiate Doctor Interval ==, <, >, - Temperature 10 °C 20 °C 10 °F Ratio ==, <, >, -, / Weight 0 kg 10 kg 20 kg

Definition of distance metric A distance function is metric if the following conditions are met for all data points x, y, z: All distances are non-negative: d(x, y) ≥ 0 Distance to point itself is zero: d(x, x) = 0 All distances are symmetric: d(x, y) = d(y, x) Triangular inequality: d(x, y)  d(x, z) + d(z, y)

Common distance metrics Xj = (xj1, xj2, …, xjp) dij = ? Minkowski distance Euclidean distance q = 2 Manhattan distance q = 1 Xi = (xi1, xi2, …, xip) 1st dimension 2nd dimension pth dimension

Distance metrics example 5 10 2D example x1 = (2,8) x2 = (6,3) Euclidean distance Manhattan distance X1 = (2,8) 5 X2 = (6,3) 4

Chebyshev distance In case of q  , the distance equals to the maximum difference of the attributes. Useful if the worst case must be avoided: Example:

Hierarchical clustering Cost functions Three cost functions exist: Single linkage Complete linkage Average linkage

Single Link The smallest distance between vectors in clusters i and j: xi Min distance xj

Complete Link The largest distance between vectors in clusters i and j: Cluster 1 Cluster 2 xj Max distance xi

Average Link The average distance between vectors in clusters i and j: Av. distance xj xi

Cost function example [Theodoridis, Koutroumbas, 2006] 1.1 1 1.2 1.3 1.4 1.5 x1 x2 x3 x4 x5 x6 x7 Data Set Single Link: Complete Link: x1 x2 x3 x4 x5 x6 x7 x1 x2 x3 x4 x5 x6 x7

Part II: Binary data

Hamming Distance (Binary and categorical data) Number of different attribute values. Distance of (1011101) and (1001001) is 2. Distance (2143896) and (2233796) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

Hard thresholding of centroid (0.40, 0.60, 0.75, 0.20, 0.45, 0.25)

Hard and soft centroids Bridge (binary version)

Distance and distortion General distance function: Distortion function:

Distortion for binary data Cost of a single attribute: The number of zeroes is qjk, the number of ones is rjk and cjk is the current centroid value for variable k of group j.

Optimal centroid position Optimal centroid position depends on the metric. Given parameter: The optimal position is:

Example of centroid location

Centroid location

Categorical clustering Three attributes Director Actor Genre t1 (Godfather II) Coppola De Niro Crime t2 (Good Fellas) Scorsese t3 (Vertigo) Hitchcock Stewart Thriller t4 (N by NW) Grant t5 (Bishop's Wife) Koster Comedy t6 (Harvey)

Categorical clustering Sample 2-d data: color and shape Model A Model B Model C

Hamming Distance (Binary and categorical data) Number of different attribute values. Distance of (1011101) and (1001001) is 2. Distance (2143896) and (2233796) Distance between (toned) and (roses) is 3. 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path) 3-bit binary cube

Histogram-based methods: K-means variants Methods: Histogram-based methods: k-modes k-medoids k-distributions k-histograms k-populations k-representatives

Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:

Iterative algorithms

K-modes clustering Distance function Vector and mode A F I A D G Distance +1 2 +1

K-modes clustering Prototype of cluster Vectors Mode A D G B D H A F I A D

K-medoids clustering Prototype of cluster Vector with minimal total distance to others 3 Medoid: 2 2 A C E B C F B D G B C F 2+3=5 2+2=4 2+3=5

K-medoids Example

K-medoids Calculation

K-histograms D 2/3 F 1/3

K-distributions Cost function with ε addition

Example of cluster allocation Change of entropy

Problem of non-convergence Non-convergence

Results with Census dataset

Literature Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), 503-507, March, 2007. ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical data clustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp. 253-262, Berkeley, USA, 2005. ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp. 345-366, 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, 1990. K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp. 283-304, 1998. K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp. 436-443, Qingdao, China, 2007. K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/0509033, http://arxiv.org/abs/cs/0509033, 2005.

Part IV: Text data

Applications of text clustering Query relaxation Spell-checking Automatic categorization Document clustering

Query relaxation Current solution Matching suffixes from database Alternate solution From semantic clustering

Spell-checking Word kahvila (café): one correct two incorrect spellings

Automatic categorization Category by clustering

Document clustering Motivation: Clustering Process: Group related documents based on their content No predefined training set (taxonomy) Generate a taxonomy at runtime Clustering Process: Data preprocessing: tokenize, remove stop words, stem, feature extraction and lexical analysis Define cost function Perform clustering 45

Text clustering String similarity is the basis for clustering text data A measure is required to calculate the similarity between two strings

String similarity Semantic: Syntactic: car and auto automobile and auto отель and готель sauna and sana

Semantic similarity Lexical database: WordNet object artifact conveyance, transport article ware vehicle bike, bicycle cutlery, eating utensil truck table ware fork instrumentality wheeled vehicle car, auto automotive, motor English Relations via generalization Sets of synonyms (synsets)

Similarity using WordNet [Wu and Palmer, 2004] Input : word1: wolf , word 2: hunting dog Output: similarity value = 0.89

Hierarchical clustering by WordNet Need better

Syntactic similarity Need examples Operates on the words and their characters Can be divided into three components: Character-level similarity measures Matching Techniques Token similarity Need examples

Syntactic similarity workflow

Character-level measures Treat strings as a sequence of characters Determine the similarity by one of three ways Exact match Transformation Longest common substring Use these examples also later! The Point 3 Tigne Point 1 No topless and other restrictions Tigne Point mall Tigne Point Blokker and other shops The Avenue Acqua terra e mare Lonely tree between houses 2 ? Golden house Chinese restaurant The Palace

Exact match Machine Learning Machine Learning Machine Learning Binary result: 1 = if the strings are identical 0 = otherwise Machine Learning Machine Learning Machine Learning Machine Learned 1 (match) 0 (mismatch)

Transformation Edit distance: Single edit operations (insertion, deletion, substitution) to transfer a string into another Hamming: Allows only substitutions. Length of the strings must be equal Jaro/Winkler: Based on the number of matching and transposed characters (a/u, u/a)

Levenshtein edit distance Example Input: string 1: kitten, string 2: sitting Output: 3 substitute s with k: sitten substitute e with i: sittin insert g: sitting

Longest common substring Finds the longest contiguous sequence of characters that co-occur in two strings Example 1: Example 2: ABABC AAAAA BABCA LCS =3 ED =2 ED = 2 LCS =1 AXAXA ABCBA

String segmentation b i n g o n Q-grams: divides string into substrings of length q. Tokenization: breaks a string into words and symbols called tokens using whitespace, line breaks, and punctuation characters. The club at the Ivy b i n g o n 2-grams

Matching techniques

Matching techniques

Token similarity Two alternatives to compare tokens: Exact matching: 1 if match, 0 otherwise. Approximate matching: compute similarity between tokens using a character-level measure

Approximate matching Example [Monge and Elkan , 1996] Input: string1: gray color, string 2: the grey colour Output: similarity value 0.85 Pairwise similarities using edit distance (smith-waterman-Gotoh) the grey colour Maximum gray 0.20 0.90 0.30 color 0.80

Similarities for sample data Compared Strings Edit distance Q-gram Q=2 Q-gram Q=3 Q-gram Q=4 Cosine distance Pizza Express Café Pizza Express 72% 79% 74% 70% 82% Lounasravintola Pinja Ky – ravintoloita Lounasravintola Pinja 54% 68% 67 % 65% 63 % Kioski Piirakkapaja Kioski Marttakahvio 47% 45% 33% 32% 50% Kauppa Kulta Keidas Kauppa Kulta Nalle 67% 60% Ravintola Beer Stop Pub Baari, Beer Stop R-kylä 39% 42% 36% 31% Ravintola Foxie s Bar Foxie Karsikko 25% 15% 12% 24% Play baari Ravintola Bar Play – Ravintoloita 21% 17% 8% Different Different

Part V: Time series

Clustering of time-series

Dynamic Time Warping Align two time-series by minimizing distance of the aligned observations Solve by dynamic programming!

Example of DTW

Prototype of a cluster Sequence c that minimizes E(Sj,c) is called a Steiner sequence. Good approximation to Steiner problem, is to use medoid of the cluster (discrete median). Medoid is such a time-series in the cluster that minimizes E(Sj,c).

Calculating the prototype Can be solved by dynamic programming. Complexity is exponential to the number of time-series in a cluster.

Averaging heuristic Calculate the medoid sequence Calculate warping paths from the medoid to all other time series in the cluster New prototype is the average sequence over warping paths

Local search heuristics

Example of the three methods E(S) = 159 E(S) = 138 E(S) = 118 LS provides better fit in terms of the Steiner cost function. It cannot modify sequence length during the iterations. In datasets with varying lengths it might provide better fit, but non-sensitive prototypes

Experiments

Part VI: Other clustering problems

Clustering of GPS trajectories

Density clusters Walking street Swim hall Market place Science park Homes of users Shop

Objects of different colors Image segmentation Objects of different colors

Literature S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd edition, 2006. P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids", Signal Processing: Image Communication, 14 (9), 677‑681, 1999. I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, 3499-3502, October 2004. Michael Pucher, F.T.W.: Performance Evaluation of WordNet-based Semantic Relatedness Measures for Word Prediction in Conversational Speech (2004) A. E.Monge, and C. Elkan. The Field Matching Problem: Algorithms and Applications. Int. Conf. on Knowledge Discovery and Data Mining, 267-270,1996.