Clustering Cost function Pasi Fränti Clustering methods: Part 4 Speech and Image Processing Unit School of Computing University of Eastern Finland 29.4.2014.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Variable Metric For Binary Vector Quantization UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Ismo Kärkkäinen and Pasi Fränti.
What is Cluster Analysis?
Clustering.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Improved TF-IDF Ranker
Ricochet A Family of Unconstrained Algorithms for Graph Clustering.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reduced Support Vector Machine
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
What is Cluster Analysis
Distance Functions for Sequence Data and Time Series
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Cluster Analysis (1).
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Methods: Part 2d Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu, FINLAND Swap-based.
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
Chapter 4 CONCEPTS OF LEARNING, CLASSIFICATION AND REGRESSION Cios / Pedrycz / Swiniarski / Kurgan.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
Cut-based & divisive clustering Clustering algorithms: Part 2b Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
1 Optimal Cycle Vida Movahedi Elder Lab, January 2008.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Fast search methods Pasi Fränti Clustering methods: Part 5 Speech and Image Processing Unit School of Computing University of Eastern Finland
FAST DYNAMIC QUANTIZATION ALGORITHM FOR VECTOR MAP COMPRESSION Minjie Chen, Mantao Xu and Pasi Fränti University of Eastern Finland.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Chapter 2: Getting to Know Your Data
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Clustering.
Clustering.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Clustering Categorical Data
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Clustering different types of data
Semi-Supervised Clustering
Clustering of Web pages
Lecture 2-2 Data Exploration: Understanding Data
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Supervised Time Series Pattern Discovery through Local Importance
Distance Functions for Sequence Data and Time Series
Clustering (3) Center-based algorithms Fuzzy k-means
School of Computer Science & Engineering
Information Organization: Clustering
Dynamic Time Warping and training methods
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Measuring the Similarity of Rhythmic Patterns
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Clustering Cost function Pasi Fränti Clustering methods: Part 4 Speech and Image Processing Unit School of Computing University of Eastern Finland

Numeric Binary Categorical Text Time series Data types

Part I: Numeric data

Distance measures

Definition of distance metric A distance function is metric if the following conditions are met for all data points x, y, z: All distances are non-negative: d(x, y) ≥ 0 Distance to point itself is zero: d(x, x) = 0 All distances are symmetric: d(x, y) = d(y, x) Triangular inequality: d(x, y)  d(x, z) + d(z, y)

Minkowski distance Euclidean distance q = 2 Manhattan distance q = 1 X i = (x i1, x i2, …, x ip ) d ij = ? X j = (x j1, x j2, …, x jp ) 1 st dimension2 nd dimensionp th dimension Common distance metrics

2D example x 1 = (2,8) x 2 = (6,3) Euclidean distance Manhattan distance X 1 = (2,8) X 2 = (6,3) Distance metrics example

Chebyshev distance In case of q  , the distance equals to the maximum difference of the attributes. Useful if the worst case must be avoided: Example:

Hierarchical clustering Cost functions Single link: the smallest distance between vectors in clusters i and j: Complete-link: the largest distance between vectors in clusters i and j: Average link: the average distance between vectors in clusters i and j:

Single Link

Complete Link

Average Link

x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 Single Link:Complete Link: Data Set Cost function example [Theodoridis, Koutroumbas, 2006]

Part II: Binary data

Hamming Distance (Binary and categorical data) Number of different attribute values. Distance of ( ) and ( ) is 2. Distance ( ) and ( ) Distance between (toned) and (roses) is 3. 3-bit binary cube 100->011 has distance 3 (red path) 010->111 has distance 2 (blue path)

Hard thresholding of centroid (0.40, 0.60, 0.75, 0.20, 0.45, 0.25)

Hard and soft centroids Bridge (binary version)

Distance and distortion General distance function: Distortion function:

Distortion for binary data Cost of a single attribute: The number of zeroes is q jk, the number of ones is r jk and c jk is the current centroid value for variable k of group j.

Optimal centroid position Optimal centroid position depends on the metric. Given parameter: The optimal position is:

Example of centroid location

Centroid location

Part III: Categorical data

Categorical clustering Three attributes

Categorical clustering Sample 2-d data: color and shape Model AModel B Model C

Categorical clustering k-modes k-medoids k-distributions k-histograms k-populations k-representatives Methods: Histogram-based methods:

Entropy-based cost functions Category utility: Entropy of data set: Entropies of the clusters relative to the data:

Iterative algorithms

K-modes clustering Distance function

K-modes clustering Prototype of cluster

K-medoids clustering Prototype of cluster Medoid: Vector with minimal total distance to every other ACEACE BCFBCF BDGBDG BCFBCF 2+3=52+2=42+3=5

K-medoids Example

K-medoids Calculation

K-histograms D 2/3 F 1/3

K-distributions Cost function with ε addition

Example of cluster allocation Change of entropy

Problem of non-convergence Non-convergence

Results with Census dataset

Part IV: Text data

Applications of text clustering Query relaxation Spell-checking Automatic categorization Document similarity

Query relaxation Current solution Matching suffixes from database Alternate solution From semantic clustering

Spell-checking Word kahvila (café) but with one correct and two incorrect spellings

HELP !!! This is Google translation… Automatic categorization Category by clustering

The similarity between every string pair is calculated as a basis for determining the clusters A similarity measure is required to calculate the similarity between two strings. Approximate string matching Semantic similarity String clustering

Motivation: –Group related documents based on their content –No predefined training set (taxonomy) –Generate a taxonomy at runtime Clustering Process: –Data preprocessing: remove stop words, stem, feature extraction and lexical analysis –Define cost function –Perform clustering Document clustering

Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T. Example: T=“AGCTTGA” P=“GCT” Applications: –Searching keywords in a file –Searching engines (like Google) –Database searching Exact string matching

Determine if a text string T of length n and a pattern string P of length m partially matches. –Consider the string “approximate”. Which of these are partial matches? aproximate approximately appropriate proximate approx approximat apropos approxximate –A partial match can be thought of as one that has k differences from the string where k is some small integer (for instance 1 or 2) –A difference occurs if the string1.charAt(j) != string2.charAt(j) or if string1.charAt(j) does not appear in string2 (or vice versa)  The former case is known as a revise difference, the latter is a delete or insert difference.  What about two characters that appear out of position? For instance, approximate vs. apporximate? Approximate string matching

Keanu Reeves Samuel Jackson Arnold Schwarzenegger H. Norman Schwarfkopf Bernard Schwartz … Schwarrzenger Query errors Limited knowledge about data Typos Limited input device (cell phone) input Data errors Typos Web data OCR Similarity functions: Edit distance Q-gram Cosine Approximate string matching

Given two strings T and P, the edit distance is the minimum number of substitutions, insertion and deletions, which will transform the string T into P. Time complexity by dynamic programming: O(nm) Edit distance Levenhstein distance

tmp 0123 t1012 e2122 m3212 p4321 Dynamic programming: m[i][j] = min{ m[i-1][j]+1, m[i][j-1]+1, m[i-1][j-1]+d(i,j)} d(i,j) =0 if i=j, d(i,j)=1 else Edit distance 1974

b i n g o n 2-grams Fixed length (q) ed(T, P) <= k, then # of common grams >= # of T grams – k * q Q-grams

T = “bingo”, P = “going” gram1 = {#b, bi, in, ng, go, o#} gram2 = {#g, go, oi, in, ng, g#} Total(gram1, gram2) = {#b,bi,in,ng,go,o#,#g, go,oi,in,ng,g#} |common terms difference|= sum{1,1,0,0,0,1,1,0,1,0,0,1} gram1.length = (T.length + (q - 1) * 2 + 1) – q gram2.length = (P.length + (q - 1) * 2 + 1) - q L = gram1.length + gram2.length=12 Similarity = (L- |common terms difference| )/ L =0.5 Q-grams

Two vectors A and B,θ is represented using a dot product and magnitude as Implementation: Cosine similarity = (Common Terms) / (sqrt(Number of terms in String1) + sqrt(Number of terms in String2)) Cosine similarity

T = “bingo right”, P = “going right” T1 = {bingo right}, P1 = {going right} L1 = unique(T1).length; L2 = unique(P1).length; Unique(T1&P1) = {bingo right going} L3 = Unique(T1&P1).length; Common terms = (L1+L2)-L3; Similarity = common terms / (sqrt(L1)*sqrt(L2)) Cosine similarity

Similar with cosine similarity Dices coefficient = (2*Common Terms) / (Number of terms in String1 + Number of terms in String2) Dice coefficient

Compared Strings Edit distance Q-gram Q=2 Q-gram Q=3 Q-gram Q=4 Cosine distance Pizza Express Café Pizza Express 72%79%74%70%82% Lounasravintola Pinja Ky – ravintoloita Lounasravintola Pinja 54%68%67 %65%63 % Kioski Piirakkapaja Kioski Marttakahvio 47%45%33%32%50% Kauppa Kulta Keidas Kauppa Kulta Nalle 68%67%63 %60%67% Ravintola Beer Stop Pub Baari, Beer Stop R-kylä 39%42%36%31%50% Ravintola Foxie s Bar Foxie Karsikko 31%25%15%12%24% Play baari Ravintola Bar Play – Ravintoloita 21%31%17%8%32% Similarities for sample data Different

Thesaurus-based WordNet

An extensive lexical network for the English language Contains over 138,838 words. Several graphs, one for each part-of-speech. Synsets (synonym sets), each defining a semantic sense. Relationship information (antonym, hyponym, meronym …) Downloadable for free (UNIX, Windows) Expanding to other languages (Global WordNet Association) Funded >$3 million, mainly government (translation interest) Founder George Miller, National Medal of Science, wet dry watery moist damp parched anhydrous arid synonym antonym WordNet

object artifact conveyance, transport article ware vehicle bike, bicycle cutlery, eating utensil truck table ware fork instrumentality wheeled vehicle car, auto automotive, motor Example of WordNet

Entity (40%) Inanimate-object (17 %) Natural-object (1.6%) Geological-formation (0.17%) Natural-elevation (0.011%) Shore ( 0.008%) Hill (0.0019%)Coast (0.002%) Examples of probabilities

Hierarchical clustering by WordNet

Word Pair Human Judgment Edge Counting Based Measures Information Content Based Measures PathWUPLINJiang&Conrath CarAutomobile GemJewel JourneyVoyage BoyLad CoastShore AsylumMadhouse MagicianWizard MiddayNoon FurnaceStove FoodFruit BirdCock BirdCrane Performance of WordNet

Part V: Time series

Clustering of time-series

Align two time-series by minimizing distance of the aligned observations Solve by dynamic programming! Dynamic Time Warping

Example of DTW

Sequence c that minimizes E(S j,c) is called a Steiner sequence. Good approximation to Steiner problem, is to use medoid of the cluster (discrete median). Medoid is such a time-series in the cluster that minimizes E(S j,c). Prototype of a cluster

Can be solved by dynamic programming. Complexity is exponential to the number of time-series in a cluster. Calculating the prototype

Calculate the medoid sequence Calculate warping paths from the medoid to all other time series in the cluster New prototype is the average sequence over warping paths Averaging heuristic

Local search heuristics

E(S) = 159 E(S) = 138E(S) = 118 LS provides better fit in terms of the Steiner cost function. It cannot modify sequence length during the iterations. In datasets with varying lengths it might provide better fit, but non-sensitive prototypes Example of the three methods

Experiments

Part VI: Other clustering problems

Clustering of GPS trajectories

Density clusters Homes of users Shop Walking street Market place Swim hall Science park

Image segmentation Objects of different colors

Literature 1.S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 2nd edition, P. Fränti and T. Kaukoranta, "Binary vector quantizer design using soft centroids", Signal Processing: Image Communication, 14 (9), 677 ‑ 681, I. Kärkkäinen and P. Fränti, "Variable metric for binary vector quantization", IEEE Int. Conf. on Image Processing (ICIP’04), Singapore, vol. 3, , October 2004.

Modified k-modes + k-histograms: M. Ng, M.J. Li, J. Z. Huang and Z. He, On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm, IEEE Trans. on Pattern Analysis and Machine Intelligence, 29 (3), , March, ACE: K. Chen and L. Liu, The “Best k'' for entropy-based categorical dataclustering, Int. Conf. on Scientific and Statistical Database Management (SSDBM'2005), pp , Berkeley, USA, ROCK: S. Guha, R. Rastogi and K. Shim, “Rock: A robust clustering algorithm for categorical attributes”, Information Systems, Vol. 25, No. 5, pp , 200x. K-medoids: L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley Sons, New York, K-modes: Z. Huang, Extensions to k-means algorithm for clustering large data sets with categorical values, Data mining knowledge discovery, Vol. 2, No. 3, pp , K-distributions: Z. Cai, D. Wang and L. Jiang, K-Distributions: A New Algorithm for Clustering Categorical Data, Int. Conf. on Intelligent Computing (ICIC 2007), pp , Qingdao, China, K-histograms: Zengyou He, Xiaofei Xu, Shengchun Deng and Bin Dong, K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset, CoRR, abs/cs/ , Literature