Applied Human Computer Interaction Week 5 Yan Ke

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
Expectation Maximization
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Visual Recognition Tutorial
EM Algorithm Likelihood, Mixture Models and Clustering.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Radial Basis Function Networks
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
EM and expected complete log-likelihood Mixture of Experts
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Unsupervised Learning
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Oliver Schulte Machine Learning 726
PDF, Normal Distribution and Linear Regression
Deep Feedforward Networks
Machine Learning and Data Mining Clustering
Lecture 18 Expectation Maximization
Model Inference and Averaging
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Data Mining Lecture 11.
Clustering Evaluation The EM Algorithm
Latent Variables, Mixture Models and EM
K-means and Hierarchical Clustering
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Clustering and Multidimensional Scaling
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
10701 / Machine Learning Today: - Cross validation,
Gaussian Mixture Models And their training with the EM algorithm
Integration of sensory modalities
CSE572, CBS572: Data Mining by H. Liu
K-means Ask user how many clusters they’d like. (e.g. k=5) Copyright © 2001, 2004, Andrew W. Moore.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
CS 188: Artificial Intelligence Fall 2008
Text Categorization Berlin Chen 2003 Reference:
Biointelligence Laboratory, Seoul National University
Machine Learning and Data Mining Clustering
Mathematical Foundations of BME Reza Shadmehr
CSE572: Data Mining by H. Liu
K-means and Hierarchical Clustering
Unsupervised Learning
Density Curves Normal Distribution Area under the curve
Presentation transcript:

Applied Human Computer Interaction Week 5 Yan Ke Clustering Applied Human Computer Interaction Week 5 Yan Ke

Basic Clustering Methods Kmeans Expectation Maximization (EM)

Some Data This could easily be modeled by a Gaussian Mixture (with 5 components) But let’s look at an satisfying, friendly and infinitely popular alternative… Copyright © 2001, 2004, Andrew W. Moore

Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Lossy Compression Copyright © 2001, 2004, Andrew W. Moore

Idea One Any Better Ideas? Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Idea One Break into a grid, decode each bit-pair as the middle of each grid-cell 00 01 10 11 Any Better Ideas? Copyright © 2001, 2004, Andrew W. Moore

Idea Two Any Further Ideas? Suppose you transmit the coordinates of points drawn randomly from this dataset. You can install decoding software at the receiver. You’re only allowed to send two bits per point. It’ll have to be a “lossy transmission”. Loss = Sum Squared Error between decoded coords and original coords. What encoder/decoder will lose the least information? Idea Two Break into a grid, decode each bit-pair as the centroid of all data in that grid-cell 00 01 11 10 Any Further Ideas? Copyright © 2001, 2004, Andrew W. Moore

K-means Ask user how many clusters they’d like. (e.g. k=5) Copyright © 2001, 2004, Andrew W. Moore

K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Copyright © 2001, 2004, Andrew W. Moore

K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) Copyright © 2001, 2004, Andrew W. Moore

K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. Each Center finds the centroid of the points it owns Copyright © 2001, 2004, Andrew W. Moore

K-means Ask user how many clusters they’d like. (e.g. k=5) Randomly guess k cluster Center locations Each datapoint finds out which Center it’s closest to. Each Center finds the centroid of the points it owns… …and jumps there …Repeat until terminated! Copyright © 2001, 2004, Andrew W. Moore

K-means Start Advance apologies: in Black and White this example will deteriorate Example generated by Dan Pelleg’s super-duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html) Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means continues… Copyright © 2001, 2004, Andrew W. Moore

K-means terminates Copyright © 2001, 2004, Andrew W. Moore

K-means Questions What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal clustering? How should we start it? How could we automatically choose the number of centers? ….we’ll deal with these questions over the next few slides Copyright © 2001, 2004, Andrew W. Moore

Will we find the optimal configuration? Not necessarily. Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…) Copyright © 2001, 2004, Andrew W. Moore

Will we find the optimal configuration? Not necessarily. Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3 configuration here…) Copyright © 2001, 2004, Andrew W. Moore

Trying to find good optima Idea 1: Be careful about where you start Idea 2: Do many runs of k-means, each from a different random start configuration Many other ideas floating around. Copyright © 2001, 2004, Andrew W. Moore

Common uses of K-means Often used as an exploratory data analysis tool In one-dimension, a good way to quantize real-valued variables into k non-uniform buckets Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization) Also used for choosing color palettes on old fashioned graphical display devices! Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Hierarchical Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Hierarchical Clustering How do we define similarity between clusters? Minimum distance between points in clusters (in which case we’re simply doing Euclidian Minimum Spanning Trees) Maximum distance between points in clusters Average distance between points in clusters Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat…until you’ve merged the whole dataset into one cluster You’re left with a nice dendrogram, or taxonomy, or hierarchy of datapoints (not shown here) Copyright © 2001, 2004, Andrew W. Moore

Single Linkage Comments Also known in the trade as Hierarchical Agglomerative Clustering (note the acronym) Single Linkage Comments It’s nice that you get a hierarchy instead of an amorphous collection of groups If you want k groups, just cut the (k-1) longest links There’s no real statistical or information-theoretic foundation to this. Makes your lecturer feel a bit queasy. Copyright © 2001, 2004, Andrew W. Moore

What you should know All the details of K-means The theory behind K-means as an optimization algorithm How K-means can get stuck The outline of Hierarchical clustering Be able to contrast between which problems would be relatively well/poorly suited to K-means vs Gaussian Mixtures vs Hierarchical clustering Copyright © 2001, 2004, Andrew W. Moore

Likelihood, Mixture Models and Clustering EM Algorithm Likelihood, Mixture Models and Clustering

Introduction In the last class the K-means algorithm for clustering was introduced. The two steps of K-means: assignment and update appear frequently in data mining tasks. In fact a whole framework under the title “EM Algorithm” where EM stands for Expectation and Maximization is now a standard part of the data mining toolkit

Outline What is Likelihood? Examples of Likelihood estimation? Information Theory – Jensen Inequality The EM Algorithm and Derivation Example of Mixture Estimations Clustering as a special case of Mixture Modeling

Meta-Idea Probability Model Data Inference (Likelihood) A model of the data generating process gives rise to data. Model estimation from data is most commonly through Likelihood estimation From PDM by HMS

Likelihood Function Likelihood Function Find the “best” model which has generated the data. In a likelihood function the data is considered fixed and one searches for the best model over the different choices available.

Model Space The choice of the model space is plentiful but not unlimited. There is a bit of “art” in selecting the appropriate model space. Typically the model space is assumed to be a linear combination of known probability distribution functions.

Examples Suppose we have the following data 0,1,1,0,0,1,1,0 In this case it is sensible to choose the Bernoulli distribution (B(p)) as the model space. Now we want to choose the best p, i.e.,

Examples Suppose the following are marks in a course 55.5, 67, 87, 48, 63 Marks typically follow a Normal distribution whose density function is Now, we want to find the best , such that

Examples Suppose we have data about heights of people (in cm) 185,140,134,150,170 Heights follow a normal (log normal) distribution but men on average are taller than women. This suggests a mixture of two distributions

Maximum Likelihood Estimation We have reduced the problem of selecting the best model to that of selecting the best parameter. We want to select a parameter p which will maximize the probability that the data was generated from the model with the parameter p plugged-in. The parameter p is called the maximum likelihood estimator. The maximum of the function can be obtained by setting the derivative of the function ==0 and solving for p.

Two Important Facts If A1,,An are independent then The log function is monotonically increasing. x · y ! Log(x) · Log(y) Therefore if a function f(x) >= 0, achieves a maximum at x1, then log(f(x)) also achieves a maximum at x1.

Example of MLE Now, choose p which maximizes L(p). Instead we will maximize l(p)= LogL(p)

Properties of MLE There are several technical properties of the estimator but lets look at the most intuitive one: As the number of data points increase we become more sure about the parameter p

Properties of MLE r is the number of data points. As the number of data points increase the confidence of the estimator increases.

Matlab commands [phat,ci]=mle(Data,’distribution’,’Bernoulli’); [phi,ci]=mle(Data,’distribution’,’Normal’);

MLE for Mixture Distributions When we proceed to calculate the MLE for a mixture, the presence of the sum of the distributions prevents a “neat” factorization using the log function. A completely new rethink is required to estimate the parameter. The new rethink also provides a solution to the clustering problem.

A Mixture Distribution

Missing Data We think of clustering as a problem of estimating missing data. The missing data are the cluster labels. Clustering is only one example of a missing data problem. Several other problems can be formulated as missing data problems.

Missing Data Problem Let D = {x(1),x(2),…x(n)} be a set of n observations. Let H = {z(1),z(2),..z(n)} be a set of n values of a hidden variable Z. z(i) corresponds to x(i) Assume Z is discrete.

EM Algorithm The log-likelihood of the observed data is Not only do we have to estimate  but also H Let Q(H) be the probability distribution on the missing data.

EM Algorithm Inequality is because of Jensen’s Inequality. This means that the F(Q,) is a lower bound on l() Notice that the log of sums is become a sum of logs

EM Algorithm The EM Algorithm alternates between maximizing F with respect to Q (theta fixed) and then maximizing F with respect to theta (Q fixed).

EM Algorithm It turns out that the E-step is just And, furthermore Just plug-in

EM Algorithm The M-step reduces to maximizing the first term with respect to  as there is no  in the second term.

EM Algorithm for Mixture of Normals E Step M-Step

EM and K-means Notice the similarity between EM for Normal mixtures and K-means. The expectation step is the assignment. The maximization step is the update of centers.

Utilizing Clustering Methods for Forecasting

Pattern Sequence Forecasting (PSF) PSF is a forecasting method mixing traditional regression methods (such as moving average (MA) and auto-regression (AR)) with clustering methods, such as Kmeans and EM PSF can achieve extremely high forecasting results in electricity consumption forecasting

Electricity Power Consumption Forecasting

Daily Power Consumption Variation

Pattern Sequence Forecasting (PSF) Step 1: Pre-processing Data (Re-organizing table) day1 1st hour 2nd hour … Class1 day2 Class2 day3 Class3

Selecting the number of Clusters

Distribution of Weekdays for Clusters

Misclassification by Kmeans

Means of Daily Electricity Consumption

Pattern Sequence Forecasting (PSF)

Predictions