Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Monday, 10 March 2008 William.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Data Mining Techniques: Clustering
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
What is Cluster Analysis
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture 09 Clustering-based Learning
Birch: An efficient data clustering method for very large databases
Radial Basis Function Networks
Introduction to undirected Data Mining: Clustering
Clustering Unsupervised learning Generating “classes”
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Computing & Information Sciences Kansas State University Data Sciences Summer Institute Multimodal Information Access and Synthesis Learning and Reasoning.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Computing & Information Sciences Kansas State University Friday, 14 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 23 of 42 Friday,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computing & Information Sciences Kansas State University Friday, 14 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 23 of 42 Friday,
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 18 April 2007 William.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, February 23, 2000.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
COMP24111 Machine Learning K-means Clustering Ke Chen.
What Is Cluster Analysis?
Semi-Supervised Clustering
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Topic 3: Cluster Analysis
Revision (Part II) Ke Chen
Ke Chen Reading: [7.3, EA], [9.1, CMB]
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Revision (Part II) Ke Chen
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Presentation transcript:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU Readings: Section 7.1 – 7.3, Han & Kamber 2e Partitioning-Based Clustering and Expectation Maximization (EM) Lecture 21 of 42

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition EM Algorithm: Example [3] Expectation Step –Suppose we observed m actual experiments, each n coin flips long Each experiment corresponds to one choice of coin (~  ) Let h denote the number of heads in experiment x i (a single data point) –Q: How did we simulate the “fictional” data points, E[  log P(x | )]? –A: By estimating (for 1  i  m, i.e., the real data points) Maximization Step –Q: What are we updating? What objective function are we maximizing? –A: We are updating to maximize where

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition EM for Unsupervised Learning Unsupervised Learning Problem –Objective: estimate a probability distribution with unobserved variables –Use EM to estimate mixture policy (more on this later; see 6.12, Mitchell) Pattern Recognition Examples –Human-computer intelligent interaction (HCII) Detecting facial features in emotion recognition Gesture recognition in virtual environments –Computational medicine [Frey, 1998] Determining morphology (shapes) of bacteria, viruses in microscopy Identifying cell structures (e.g., nucleus) and shapes in microscopy –Other image processing –Many other examples (audio, speech, signal processing; motor control; etc.) Inference Examples –Plan recognition: mapping from (observed) actions to agent’s (hidden) plans –Hidden changes in context: e.g., aviation; computer security; MUDs

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: AutoClass [1] Bayesian Unsupervised Learning –Given: D = {(x 1, x 2, …, x n )} (vectors of indistingushed attribute values) –Return: set of class labels that has maximum a posteriori (MAP) probability Intuitive Idea –Bayesian learning: –MDL/BIC (Occam’s Razor): priors P(h) express “cost of coding” each model h –AutoClass Define mutually exclusive, exhaustive clusters (class labels) y 1, y 2, …, y J Suppose: each y j (1  j  J) contributes to x Suppose also: y j ’s contribution ~ known pdf, e.g., Mixture of Gaussians (MoG) Conjugate priors: priors on y of same form as priors on x When to Use for Clustering –Believe (or can assume): clusters generated by known pdf –Believe (or can assume): clusters combined using finite mixture (later)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: AutoClass [2] AutoClass Algorithm [Cheeseman et al, 1988] –Based on maximizing P(x |  j, y j, J)  j : class (cluster) parameters (e.g., mean and variance) y j : synthetic classes (can estimate marginal P(y j ) any time) –Apply Bayes’s Theorem, use numerical BOC estimation techniques (cf. Gibbs) –Search objectives Find best J (ideally: integrate out  j, y j ; really: start with big J, decrease) Find  j, y j : use MAP estimation, then “integrate in the neighborhood” of y MAP EM: Find MAP Estimate for P(x |  j, y j, J) by Iterative Refinement Advantages over Symbolic (Non-Numerical) Methods –Returns probability distribution over class membership More robust than “best” y j Compare: fuzzy set membership (similar but probabilistically motivated) –Can deal with continuous as well as discrete data

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: AutoClass [3] AutoClass Resources –Beginning tutorial (AutoClass II): Cheeseman et al, Buchanan and Wilkins –Project page: Applications –Knowledge discovery in databases (KDD) and data mining Infrared astronomical satellite (IRAS): spectral atlas (sky survey) Molecular biology: pre-clustering DNA acceptor, donor sites (mouse, human) LandSat data from Kansas (30 km 2 region, 1024 x 1024 pixels, 7 channels) Positive findings: see book chapter by Cheeseman and Stutz, online –Other typical applications: see KD Nuggets ( Implementations –Obtaining source code from project page AutoClass III: Lisp implementation [Cheeseman, Stutz, Taylor, 1992] AutoClass C: C implementation [Cheeseman, Stutz, Taylor, 1998] –These and others at:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: Competitive Learning for Feature Discovery Intuitive Idea: Competitive Mechanisms for Unsupervised Learning –Global organization from local, competitive weight update Basic principle expressed by Von der Malsburg Guiding examples from (neuro)biology: lateral inhibition –Previous work: Hebb, 1949; Rosenblatt, 1959; Von der Malsburg, 1973; Fukushima, 1975; Grossberg, 1976; Kohonen, 1982 A Procedural Framework for Unsupervised Connectionist Learning –Start with identical (“neural”) processing units, with random initial parameters –Set limit on “activation strength” of each unit –Allow units to compete for right to respond to a set of inputs Feature Discovery –Identifying (or constructing) new features relevant to supervised learning –Examples: finding distinguishable letter characteristics in handwriten character recognition (HCR), optical character recognition (OCR) –Competitive learning: transform X into X’; train units in X’ closest to x

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: Kohonen’s Self-Organizing Map (SOM) [1] Another Clustering Algorithm –aka Self-Organizing Feature Map (SOFM) –Given: vectors of attribute values (x 1, x 2, …, x n ) –Returns: vectors of attribute values (x 1 ’, x 2 ’, …, x k ’) Typically, n >> k (n is high, k = 1, 2, or 3; hence “dimensionality reducing”) Output: vectors x’, the projections of input points x; also get P(x j ’ | x i ) Mapping from x to x’ is topology preserving Topology Preserving Networks –Intuitive idea: similar input vectors will map to similar clusters –Recall: informal definition of cluster (isolated set of mutually similar entities) –Restatement: “clusters of X (high-D) will still be clusters of X’ (low-D)” Representation of Node Clusters –Group of neighboring artificial neural network units (neighborhood of nodes) –SOMs: combine ideas of topology-preserving networks, unsupervised learning Implementation: and MATLAB NN Toolkit

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Kohonen Network (SOM) for Clustering –Training algorithm: unnormalized competitive learning –Map is organized as a grid (shown here in 2D) Each node (grid element) has a weight vector w j Dimension of w j is n (same as input vector) Number of trainable parameters (weights): m · m · n for an m-by-m SOM 1999 state-of-the-art: typical small SOMs 5-20, “industrial strength” > 20 –Output found by selecting j* whose w j has minimum Euclidean distance from x Only one active node, aka Winner-Take-All (WTA): winning node j* i.e., j* = arg min j || w j - x || 2 Update Rule Same as competitive learning algorithm, with one modification Neighborhood function associated with j* spreads the w j around Unsupervised Learning: Kohonen’s Self-Organizing Map (SOM) [2] x : vector in n-space x’ : vector in 2-space

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: Kohonen’s Self-Organizing Map (SOM) [3] j* Traditional Competitive Learning Only train j* Corresponds to neighborhood of 0 Neighborhood Function h j, j* –For 2D Kohonen SOMs, h is typically a square or hexagonal region j*, the winner, is at the center of Neighborhood (j*) h j*, j*  1 –Nodes in Neighborhood (j) updated whenever j wins, i.e., j* = j –Strength of information fed back to w j is inversely proportional to its distance from the j* for each x –Often use exponential or Gaussian (normal) distribution on neighborhood to decay weight delta as distance from j* increases Annealing of Training Parameters –Neighborhood must shrink to 0 to achieve convergence –r (learning rate) must also decrease monotonically Neighborhood of 1

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: SOM and Other Projections for Clustering Cluster Formation and Segmentation Algorithm (Sketch) Dimensionality- Reducing Projection (x’) Clusters of Similar Records Delaunay Triangulation Voronoi (Nearest Neighbor) Diagram (y)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning: Other Algorithms (PCA, Factor Analysis) Intuitive Idea –Q: Why are dimensionality-reducing transforms good for supervised learning? –A: There may be many attributes with undesirable properties, e.g., Irrelevance: x i has little discriminatory power over c(x) = y i Sparseness of information: “feature of interest” spread out over many x i ’s (e.g., text document categorization, where x i is a word position) We want to increase the “information density” by “squeezing X down” Principal Components Analysis (PCA) –Combining redundant variables into a single variable (aka component, or factor) –Example: ratings (e.g., Nielsen) and polls (e.g., Gallup); responses to certain questions may be correlated (e.g., “like fishing?” “time spent boating”) Factor Analysis (FA) –General term for a class of algorithms that includes PCA –Tutorial:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Clustering Methods: Design Choices Intuition –Functional (declarative) definition: easy (“We recognize a cluster when we see it”) –Operational (procedural, constructive) definition: much harder to give –Possible reason: clustering of objects into groups has taxonomic semantics (e.g., shape, size, time, resolution, etc.) Possible Assumptions –Data generated by a particular probabilistic model –No statistical assumptions Design Choices –Distance (similarity) measure: standard metrics, transformation-invariant metrics L 1 (Manhattan):  |x i - y i |, L 2 (Euclidean):, L  (Sup): max |x i - y i | Symmetry: Mahalanobis distance Shift, scale invariance: covariance matrix –Transformations (e.g., covariance diagonalization: rotate axes to get rotational invariance, cf. PCA, FA)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Clustering: Applications ThemeScapes news stories from the WWW in 1997 Information Retrieval: Text Document Categorization Transactional Database Mining NCSA D2K Confidential and proprietary to Caterpillar; may only be used with prior written consent from Caterpillar. NCSA D2K Facial Feature Extraction FaceFeatureFinding.html Data from T. Mitchell’s web site:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Unsupervised Learning and Constructive Induction Unsupervised Learning in Support of Supervised Learning –Given: D  labeled vectors (x, y) –Return: D’  transformed training examples (x’, y’) –Solution approach: constructive induction Feature “construction”: generic term Cluster definition Feature Construction: Front End –Synthesizing new attributes Logical: x 1   x 2, arithmetic: x 1 + x 5 / x 2 Other synthetic attributes: f(x 1, x 2, …, x n ), etc. –Dimensionality-reducing projection, feature extraction –Subset selection: finding relevant attributes for a given target y –Partitioning: finding relevant attributes for given targets y 1, y 2, …, y p Cluster Definition: Back End –Form, segment, and label clusters to get intermediate targets y’ –Change of representation: find an (x’, y’) that is good for learning target y Constructive Induction (x, y) x’ / (x 1 ’, …, x p ’) Cluster Definition (x’, y’) or ((x 1 ’, y 1 ’), …, (x p ’, y p ’)) Feature (Attribute) Construction and Partitioning

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Clustering: Relation to Constructive Induction Clustering versus Cluster Definition –Clustering: 3-step process –Cluster definition: “back end” for feature construction Clustering: 3-Step Process –Form (x 1 ’, …, x k ’) in terms of (x 1, …, x n ) NB: typically part of construction step, sometimes integrates both –Segment (y 1 ’, …, y J ’) in terms of (x 1 ’, …, x k ’) NB: number of clusters J not necessarily same as number of dimensions k –Label Assign names (discrete/symbolic labels (v 1 ’, …, v J ’)) to (y 1 ’, …, y J ’) Important in document categorization (e.g., clustering text for info retrieval) Hierarchical Clustering: Applying Clustering Recursively

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition CLUTO Clustering Algorithms –High-performance & High-quality partitional clustering –High-quality agglomerative clustering –High-quality graph-partitioning-based clustering –Hybrid partitional & agglomerative algorithms for building trees for very large datasets. Cluster Analysis Tools –Cluster signature identification –Cluster organization identification Visualization Tools –Hierarchical Trees –High-dimensional datasets –Cluster relations Interfaces –Stand-alone programs –Library with a fully published API Available on Windows, Sun, and Linux

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Today Clustering –Distance Measures –Graph-based Techniques –K-Means Clustering –Tools and Software for Clustering

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Prediction, Clustering, Classification What is Prediction? –The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes –A model is first created based on the data distribution –The model is then used to predict future or unknown values Supervised vs. Unsupervised Classification –Supervised Classification = Classification We know the class labels and the number of classes –Unsupervised Classification = Clustering We do not know the class labels and may not know the number of classes

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition What is Clustering in Data Mining? Cluster: –a collection of data objects that are “similar” to one another and thus can be treated collectively as one group –but as a collection, they are sufficiently different from other groups Clustering –unsupervised classification –no predefined classes Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters Helps users understand the natural grouping or structure in a data set

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Requirements of Clustering Methods Scalability Dealing with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records The curse of dimensionality Interpretability and usability Scalability Dealing with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records The curse of dimensionality Interpretability and usability

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Applications of Clustering Clustering has wide applications in Pattern Recognition Spatial Data Analysis: –create thematic maps in GIS by clustering feature spaces –detect spatial clusters and explain them in spatial data mining Image Processing Market Research Information Retrieval –Document or term categorization –Information visualization and IR interfaces Web Mining –Cluster Web usage data to discover groups of similar access patterns –Web Personalization

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Clustering Methodologies Two general methodologies –Partitioning Based Algorithms –Hierarchical Algorithms Partitioning Based –divide a set of N items into K clusters (top-down) Hierarchical –agglomerative: pairs of items or clusters are successively linked to produce larger clusters –divisive: start with the whole set as a cluster and successively divide sets into smaller partitions

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Distance or Similarity Measures Measuring Distance –In order to group similar items, we need a way to measure the distance between objects (e.g., records) –Note: distance = inverse of similarity –Often based on the representation of objects as “ feature vectors ” An Employee DB Term Frequencies for Documents Which objects are more similar?

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Distance or Similarity Measures Properties of Distance Measures: –for all objects A and B, dist(A, B)  0, and dist(A, B) = dist(B, A) –for any object A, dist(A, A) = 0 –dist(A, C)  dist(A, B) + dist (B, C) Common Distance Measures: –Manhattan distance: –Euclidean distance: –Cosine similarity: Can be normalized to make values fall between 0 and 1. Can be normalized to make values fall between 0 and 1.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Distance or Similarity Measures Weighting Attributes –in some cases we want some attributes to count more than others –associate a weight with each of the attributes in calculating distance, e.g., Nominal (categorical) Attributes –can use simple matching: distance=1 if values match, 0 otherwise –or convert each nominal attribute to a set of binary attribute, then use the usual distance measure –if all attributes are nominal, we can normalize by dividing the number of matches by the total number of attributes Normalization: –want values to fall between 0 an 1: –other variations possible

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Distance or Similarity Measures Example –max distance for age: = –max distance for age: = 25 –dist(ID2, ID3) = SQRT( 0 + (0.04) 2 + (0.44) 2 ) = 0.44 –dist(ID2, ID4) = SQRT( 1 + (0.72) 2 + (0.12) 2 ) = 1.24

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Domain Specific Distance Functions For some data sets, we may need to use specialized functions –we may want a single or a selected group of attributes to be used in the computation of distance - same problem as “ feature selection ” –may want to use special properties of one or more attribute in the data –natural distance functions may exist in the data Example: Zip Codes dist zip (A, B) = 0, if zip codes are identical dist zip (A, B) = 0.1, if first 3 digits are identical dist zip (A, B) = 0.5, if first digits are identical dist zip (A, B) = 1, if first digits are different Example: Zip Codes dist zip (A, B) = 0, if zip codes are identical dist zip (A, B) = 0.1, if first 3 digits are identical dist zip (A, B) = 0.5, if first digits are identical dist zip (A, B) = 1, if first digits are different Example: Customer Solicitation dist solicit (A, B) = 0, if both A and B responded dist solicit (A, B) = 0.1, both A and B were chosen but did not respond dist solicit (A, B) = 0.5, both A and B were chosen, but only one responded dist solicit (A, B) = 1, one was chosen, but the other was not Example: Customer Solicitation dist solicit (A, B) = 0, if both A and B responded dist solicit (A, B) = 0.1, both A and B were chosen but did not respond dist solicit (A, B) = 0.5, both A and B were chosen, but only one responded dist solicit (A, B) = 1, one was chosen, but the other was not

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Distance (Similarity) Matrix Note that d ij = d ji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) Note that d ij = d ji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) Similarity (Distance) Matrix –based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) –(i, j) entry in the matrix is the distance (similarity) between items i and j

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Example: Term Similarities in Documents Term-Term Similarity Matrix Term-Term Similarity Matrix

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Similarity (Distance) Thresholds –A similarity (distance) threshold may be used to mark pairs that are “ sufficiently ” similar Using a threshold value of 10 in the previous example

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Graph Representation The similarity matrix can be visualized as an undirected graph –each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T1 T3 T4 T6 T8 T5 T2 T7 If no threshold is used, then matrix can be represented as a weighted graph If no threshold is used, then matrix can be represented as a weighted graph

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Simple Clustering Algorithms If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering Clique Method (complete link) –all items within a cluster must be within the similarity threshold of all other items in that cluster –clusters may overlap –generally produces small but very tight clusters Single Link Method –any item in a cluster must be within the similarity threshold of at least one other item in that cluster –produces larger but weaker clusters Other methods –star method - start with an item and place all related items in that cluster –string method - start with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Simple Clustering Algorithms Clique Method –a clique is a completely connected subgraph of a graph –in the clique method, each maximal clique in the graph becomes a cluster T1 T3 T4 T6 T8 T5 T2 T7 Maximal cliques (and therefore the clusters) in the previous example are: {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Simple Clustering Algorithms Single Link Method –selected an item not in a cluster and place it in a new cluster –place all other similar item in that cluster –repeat step 2 for each item in the cluster until nothing more can be added –repeat steps 1-3 for each item that remains unclustered T1 T3 T4 T6 T8 T5 T2 T7 In this case the single link method produces only two clusters: {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Clustering with Existing Clusters The notion of comparing item similarities can be extended to clusters themselves, by focusing on a representative vector for each cluster –cluster representatives can be actual items in the cluster or other “ virtual ” representatives such as the centroid –this methodology reduces the number of similarity computations in clustering –clusters are revised successively until a stopping condition is satisfied, or until no more changes to clusters can be made Partitioning Methods –reallocation method - start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning –Single pass method - simple and efficient, but produces large clusters, and depends on order in which items are processed Hierarchical Agglomerative Methods –starts with individual items and combines into clusters –then successively combine smaller clusters to form larger ones –grouping of individual items can be based on any of the methods discussed earlier

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition K-Means Algorithm The basic algorithm (based on reallocation method): 1. select K data points as the initial representatives 2. for i = 1 to N, assign item x i to the most similar centroid (this gives K clusters) 3. for j = 1 to K, recalculate the cluster centroid C j 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Clustering Terms Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Example: K-Means Example (continued) Now using simple similarity measure, compute the new cluster-term similarity matrix Now compute new cluster centroids using the original document-term matrix The process is repeated until no further changes are made to the clusters

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition K-Means Algorithm Strength of the k-means: –Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n –Often terminates at a local optimum Weakness of the k-means: –Applicable only when mean is defined; what about categorical data? –Need to specify k, the number of clusters, in advance –Unable to handle noisy data and outliers Variations of K-Means usually differ in: –Selection of the initial k means –Dissimilarity calculations –Strategies to calculate cluster means

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Terminology Expectation-Maximization (EM) Algorithm –Iterative refinement: repeat until convergence to a locally optimal label –Expectation step: estimate parameters with which to simulate data –Maximization step: use simulated (“fictitious”) data to update parameters Unsupervised Learning and Clustering –Constructive induction: using unsupervised learning for supervised learning Feature construction: “front end” - construct new x values Cluster definition: “back end” - use these to reformulate y –Clustering problems: formation, segmentation, labeling –Key criterion: distance metric (points closer intra-cluster than inter-cluster) –Algorithms AutoClass: Bayesian clustering Principal Components Analysis (PCA), factor analysis (FA) Self-Organizing Maps (SOM): topology preserving transform (dimensionality reduction) for competitive unsupervised learning

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Summary Points Expectation-Maximization (EM) Algorithm Unsupervised Learning and Clustering –Types of unsupervised learning Clustering, vector quantization Feature extraction (typically, dimensionality reduction) –Constructive induction: unsupervised learning in support of supervised learning Feature construction (aka feature extraction) Cluster definition –Algorithms EM: mixture parameter estimation (e.g., for AutoClass) AutoClass: Bayesian clustering Principal Components Analysis (PCA), factor analysis (FA) Self-Organizing Maps (SOM): projection of data; competitive algorithm –Clustering problems: formation, segmentation, labeling Next Lecture: Time Series Learning and Characterization