Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

Image Modeling & Segmentation
Mixture Models and the EM Algorithm
Expectation Maximization
Supervised Learning Recap
K-means clustering Hongning Wang
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
A Probabilistic Framework for Semi-Supervised Clustering
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Bayesian network inference
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Visual Recognition Tutorial
Learning Bayesian Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
EM and expected complete log-likelihood Mixture of Experts
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins,
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Machine Learning Neural Networks (3). Understanding Supervised and Unsupervised Learning.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Lecture 2: Statistical learning primer for biologists
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Class #21 – Tuesday, November 10
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Qian Liu CSE spring University of Pennsylvania
Irina Rish IBM T.J.Watson Research Center
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Important Distinctions in Learning BNs
EM for Inference in MV Data
K-means Ask user how many clusters they’d like. (e.g. k=5) Copyright © 2001, 2004, Andrew W. Moore.
Professor Marie desJardins,
Bayesian Learning Chapter
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Class #25 – Monday, November 29
EM for Inference in MV Data
Unsupervised Learning: Clustering
Learning Bayesian networks
Presentation transcript:

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for Andrew’s repository of Data Mining tutorials.

Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y. Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y. But, what if we don’t have labels? But, what if we don’t have labels? No labels = unsupervised learning No labels = unsupervised learning Only some points are labeled = semi-supervised learning Only some points are labeled = semi-supervised learning Labels may be expensive to obtain, so we only get a few. Labels may be expensive to obtain, so we only get a few. Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery. Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery.

Clustering Data

K-Means Clustering K-Means ( k, data ) Randomly choose k cluster center locations (centroids). Loop until convergence Assign each point to the cluster of the closest centroid. Reestimate the cluster centroids based on the data assigned to each.

K-Means Clustering K-Means ( k, data ) Randomly choose k cluster center locations (centroids). Loop until convergence Assign each point to the cluster of the closest centroid. Reestimate the cluster centroids based on the data assigned to each.

K-Means Clustering K-Means ( k, data ) Randomly choose k cluster center locations (centroids). Loop until convergence Assign each point to the cluster of the closest centroid. Reestimate the cluster centroids based on the data assigned to each.

K-Means Animation Example generated by Andrew Moore using Dan Pelleg’s super- duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999.

Problems with K-Means Very sensitive to the initial points. Very sensitive to the initial points. Do many runs of k-Means, each with different initial centroids. Do many runs of k-Means, each with different initial centroids. Seed the centroids using a better method than random. (e.g. Farthest-first sampling) Seed the centroids using a better method than random. (e.g. Farthest-first sampling) Must manually choose k. Must manually choose k. Learn the optimal k for the clustering. (Note that this requires a performance measure.) Learn the optimal k for the clustering. (Note that this requires a performance measure.)

Problems with K-Means How do you tell it which clustering you want? How do you tell it which clustering you want? Constrained clustering techniques Constrained clustering techniques Same-cluster constraint (must-link) Different-cluster constraint (cannot-link)

Learning Bayes Nets Some material adapted from lecture notes by Lise Getoor and Ron Parr Adapted from slides by Tim Finin and Marie desJardins.

Learning Bayesian networks Given training set Given training set Find B that best matches D Find B that best matches D model selection model selection parameter estimation parameter estimation Data D Inducer C A EB

Parameter estimation Assume known structure Assume known structure Goal: estimate BN parameters  Goal: estimate BN parameters  entries in local probability models, P(X | Parents(X)) entries in local probability models, P(X | Parents(X)) A parameterization  is good if it is likely to generate the observed data: A parameterization  is good if it is likely to generate the observed data: Maximum Likelihood Estimation (MLE) Principle: Choose   so as to maximize L Maximum Likelihood Estimation (MLE) Principle: Choose   so as to maximize L i.i.d. samples

Parameter estimation II The likelihood decomposes according to the structure of the network The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter The MLE (maximum likelihood estimate) solution: The MLE (maximum likelihood estimate) solution: for each value x of a node X for each value x of a node X and each instantiation u of Parents(X) and each instantiation u of Parents(X) Just need to collect the counts for every combination of parents and children observed in the data Just need to collect the counts for every combination of parents and children observed in the data MLE is equivalent to an assumption of a uniform prior over parameter values MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

Sufficient statistics: Example Why are the counts sufficient? Why are the counts sufficient? EarthquakeBurglary Alarm Moon-phase Light-level θ * A | E, B = N(A, E, B) / N(E, B)

Model selection Goal: Select the best network structure, given the data Input: Training data Training data Scoring function Scoring functionOutput: A network that maximizes the score A network that maximizes the score

Structure selection: Scoring Bayesian: prior over parameters and structure Bayesian: prior over parameters and structure get balance between model complexity and fit to data as a byproduct get balance between model complexity and fit to data as a byproduct Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Marginal likelihood just comes from our parameter estimates Marginal likelihood just comes from our parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity Prior on structure can be any measure we want; typically a function of the network complexity Same key property: Decomposability Score(structure) =  i Score(family of X i ) Marginal likelihood Prior

Heuristic search B E A C B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A) Reverse E  A

Exploiting decomposability B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A) Reverse E  A B E A C Δscore(A) Delete E  A To recompute scores, only need to re-score families that changed in the last move

Variations on a theme Known structure, fully observable: only need to do parameter estimation Known structure, fully observable: only need to do parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Unknown structure, hidden variables: too hard to solve! Unknown structure, hidden variables: too hard to solve!

Handling missing data Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary Should we throw that data away?? Should we throw that data away?? Idea: Guess the missing values based on the other data Idea: Guess the missing values based on the other data EarthquakeBurglary Alarm Moon-phase Light-level

EM (expectation maximization) Guess probabilities for nodes with missing values (e.g., based on other observations) Guess probabilities for nodes with missing values (e.g., based on other observations) Compute the probability distribution over the missing values, given our guess Compute the probability distribution over the missing values, given our guess Update the probabilities based on the guessed values Update the probabilities based on the guessed values Repeat until convergence Repeat until convergence

EM example Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 We estimate the CPTs based on the rest of the data We estimate the CPTs based on the rest of the data We then estimate P(Burglary) for November 27 from those CPTs We then estimate P(Burglary) for November 27 from those CPTs Now we recompute the CPTs as if that estimated value had been observed Now we recompute the CPTs as if that estimated value had been observed Repeat until convergence! Repeat until convergence! EarthquakeBurglary Alarm