DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Image Modeling & Segmentation
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
K-means clustering Hongning Wang
Machine Learning and Data Mining Clustering
Visual Recognition Tutorial
Clustering II.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Expectation Maximization Algorithm
Unsupervised Learning
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Maximum Likelihood (ML), Expectation Maximization (EM)
What is Cluster Analysis?
Visual Recognition Tutorial
EM Algorithm Likelihood, Mixture Models and Clustering.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Radial Basis Function Networks
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Machine Learning Queens College Lecture 7: Clustering.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Clustering (3) Center-based algorithms Fuzzy k-means
Latent Variables, Mixture Models and EM
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Biointelligence Laboratory, Seoul National University
Presentation transcript:

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

CLUSTERING AND CLUSTER ANALYSIS Data Mining Lecture IV [Chapter 8: sections 8.4 and Chapter 9 from Principles of Data Mining by Hand,, Manilla, Smyth ]

1. Clustering versus Classification classification: give a pre-determined label to a sample clustering: provide the relevant labels for classification from structure in a given dataset clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity Objectives: - 1. segmentation of space - 2. find natural subclasses

Examples of Clustering and Classification 1. Computer Vision

Examples of Clustering and Classification: 1. Computer Vision

Example of Clustering and Classification: 1. Computer Vision

Examples of Clustering and Classification: 2. Types of chemical reactions

Examples of Clustering and Classification: 2. Types of chemical reactions

Voronoi Clustering Georgy Fedoseevich Voronoy

Voronoi Clustering A Voronoi diagram (also called a Voronoi tessellation, Voronoi decomposition, Dirichlet tessellation), is a special kind of decomposition of a metric space determined by distances to a specified discrete set of objects in the space, e.g., by a discrete set of points.

Voronoi Clustering

Voronoi Clustering

Voronoi Clustering

Partitional Clustering [book section 9.4] score-functions centroid intra-cluster distance inter-cluster distance C-means [book page 303]

k-means clustering (also: C-means) The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster, ie its coordinates is the arithmetic mean for each dimension separately for all the points in the cluster.

k-means clustering (also: C-means) Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2

k-means clustering (also: C-means) This is the basic structure of the algorithm (J. MacQueen, 1967): Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers Assign each point to the nearest cluster center. Recompute the new cluster centers. Repeat until some convergence criterion is met (usually that the assignment hasn't changed).

C-means [book page 303] while changes in cluster C k % form clusters for k=1,…,K do C k = {x | ||x – r k || < || x – r l || } end % compute new cluster centroids for k=1,…,K do r k = mean({x | x  C k }) end

k-means clustering (also: C-means) The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k- means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

k-means clustering

k-means clustering (also: C-means)

Fuzzy c-means One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster, may not be as much in the cluster as points in the center of cluster.

Fuzzy c-means Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the k-th cluster u k (x). Usually, the sum of those coefficients has to be one, so that u k (x) denotes a probability of belonging to a certain cluster:

Fuzzy c-means With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is:

Fuzzy c-means The degree of being in a certain cluster is related to the inverse of the distance to the cluster then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So :

Fuzzy c-means For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.

Fuzzy c-means The fuzzy c-means algorithm is greatly similar to the k-means algorithm :

Fuzzy c-means Choose a number of clusters Assign randomly to each point coefficients for being in the clusters Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : Compute the centroid for each cluster, using the formula above For each point, compute its coefficients of being in the clusters, using the formula above

Fuzzy C-means uij is membership of sample i to custer j ck is centroid of custer i while changes in cluster Ck % compute new memberships for k=1,…,K do for i=1,…,N do ujk = f(xj – ck) end % compute new cluster centroids for k=1,…,K do % weighted mean ck = SUMj jkxk xj /SUMj ujk end

Fuzzy c-means The fuzzy c-means algorithm minimizes intra- cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Fuzzy c-means

Fuzzy c-means

Hierarchical Clustering [book section 9.5] One major problem with partitional clustering is that the number of clusters (= #classes) must be pre-specified !!! This poses the question: what IS the real number of clusters in a given set of data? Answer: it depends! Agglomerative methods: bottom-up Divisive methods: top-down

Hierarchical Clustering Agglomerative hierarchical clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Example of Clustering and Classification

1. Clustering versus Classification classification: give a pre-determined label to a sample clustering: provide the relevant labels for classification from structure in a given dataset clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity Objectives: - 1. segmentation of space - 2. find natural subclasses

DATA ANALYSIS AND UNCERTAINTY Data Mining Lecture V [Chapter 4, Hand, Manilla, Smyth ]

Random Variables [4.3] multivariate random variables marginal density conditional density & dependency: p(x|y) = p(x,y) / p(y) * example: supermarket purchases RANDOM VARIABLES

Example: supermarket purchases X = n customers x p products; X(i,j) = Boolean variable: “Has customer #i bought a product of type p ?” nA = sum(X(:,A)) is number of customers that bought product A nB = sum(X(:,B)) is number of customers that bought product B nAB = sum(X(:,A).*X(:,B)) is number of customers that bought product B *** Demo: matlab: conditionaldensity RANDOM VARIABLES

(conditionally) independent: p(x,y) = p(x)*p(y) i.e. : p(x|y) = p(x) RANDOM VARIABLES

SAMPLING

ESTIMATION

Maximum Likelihood Estimation

BAYESIAN ESTIMATION

PROBABILISTIC MODEL-BASED CLUSTERING USING MIXTURE MODELS Data Mining Lecture VI [4.5, 8.4, 9.2, 9.6, Hand, Manilla, Smyth ]

Probabilistic Model-Based Clustering using Mixture Models A probability mixture model A mixture model is a formalism for modeling a probability density function as a sum of parameterized functions. In mathematical terms:

A probability mixture model where p X (x) is the modeled probability distribution function, K is the number of components in the mixture model, and a k is mixture proportion of component k. By definition, 0 < a k < 1 for all k = 1…K and:

A probability mixture model h(x | λ k ) is a probability distribution parameterized by λ k. Mixture models are often used when we know h(x) and we can sample from p X (x), but we would like to determine the a k and λ k values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.

A common approach for ‘decomposing’ a mixture model It is common to think of mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.

Probabilistic Model-Based Clustering using Mixture Models The EM-algorithm [book section 8.4]

Mixture Decomposition: The ‘Expectation-Maximization’ Algorithm The Expectation-maximization algorithm computes the missing memberships of data points in our chosen distribution model. It is an iterative procedure, where we start with initial parameters for our model distribution (the a k 's and λ k 's of the model listed above). The estimation process proceeds iteratively in two steps, the Expectation Step, and the Maximization Step.

The ‘Expectation-Maximization’ Algorithm The expectation step With initial guesses for the parameters in our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point.

The ‘Expectation-Maximization’ Algorithm The maximization step With the expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters. For the mixing coefficient of this is simply the fractional membership of all data points in the second distribution.

EM-algorithm for Clustering The Suppose we have data D with a model with parameters  and hidden parameters H Interpretation: H = the class label Log-likelihood of observed data:

EM-algorithm for Clustering With p the probability over the data D. Let Q be the unknown distribution over the hidden parameters H Then the log-likelihood is:

[*Jensen’s inequality]

Jensen’s inequality for a concave-down function, the expected value of the function is less than the function of the expected value. The gray rectangle along the horizontal axis represents the probability distribution of x, which is uniform for simplicity, but the general idea applies for any distribution

EM-algorithm So: F(Q,  ) is a lower-bound on the log-likelihood function l(Q,  ). EM alternates between: E-step: maximising F to Q with fixed , and: M-step: maximising F to  with fixed Q.

EM-algorithm E-step: M-step:

Probabilistic Model-Based Clustering using Gaussian Mixtures

Probabilistic Model-Based Clustering using Mixture Models

Gaussian Mixture Decomposition Gaussian mixture Decomposition is a good classificator. It allows supervised as well as unsupervised learning (find how many classes is optimal, how they should be defined,...). But training is iterative and time consuming. Idea is to set position and width of gaussian distribution(s) to optimize the coverage of learning samples.

Probabilistic Model-Based Clustering using Mixture Models

The End