Unsupervised, Cont’d Expectation Maximization
Presentation tips Practice! Work on knowing what you’re going to say at each point. Know your own presentation Practice! Work on timing You have 15 minutes to talk + 3 minutes for questions Will be graded on adherence to time! Timing is hard. Becomes easier as you practice
Presentation tips Practice! What appears on your screen is diff than what will appear when projected Different size; different font; different line thicknesses; different color Avoid hard-to-distinguish colors (red on blue) Don’t completely rely on color for visual distinctions
The final report Due: Dec 17, 5:00 PM (last day of finals week) Should contain: Intro: what was your problem; why should we care about it? Background: what have other people done? Your work: what did you do? Was it novel or re-implementation? (Algorithms, descriptions, etc.) Results: Did it work? How do we know? (Experiments, plots & tables, etc.) Discussion: What did you/we learn from this? Future work: What would you do next/do over? Length: Long enough to convey all that
The final report Will be graded on: Content: Have you accomplished what you set out to? Have you demonstrated your conclusions? Have you described what you did well? Analysis: have you thought clearly about what you accomplished, drawn appropriate conclusions, formulated appropriate “future work”, etc? Writing and clarity: Have you conveyed your ideas clearly and concisely? Are all of your conclusions supported by arguments? Are your algorithms/data/etc. described clearly?
Back to clustering Purpose of clustering: Find “chunks” of “closely related” data Uses notion of similarity among points Often, distance is interpreted as similarity Agglomerative: Start w/ individuals==clusters; join together clusters There’s also divisive: Start w/ all data==one cluster; split apart clusters
Combinatorial clustering General clustering framework: Set target of k clusters Choose a cluster optimality criterion Often function of “between-cluster variation” vs. “within-cluster variation” Find assignment of points to clusters that minimizes (maximizes) this criterion Q: Given N data points and k clusters, how many possible clusterings are there?
Example clustering criteria Define: Cluster i : Cluster i mean: Between-cluster variation: Within-cluster variation:
Example clustering criteria Now want some way to trade off within vs. between Usually want to decrease w/in-cluster var, but increase between-cluster var E.g., maximize: or: α >0 controls relative importance of terms
Comb. clustering example Clustering of seismological data
Unsup. prob. modeling Sometimes, instead of clusters want a full probability model of data Can sometimes use prob. model to get clusters Recall: in supervised learning, we said: Find a probability model, Pr[X|C i ] for each class, C i Now: find a prob. model for data w/o knowing class: Pr[X] Simplest: fit your favorite model via ML Harder: assume a “hidden cluster ID” variable
Hidden variables Assume data is generated by k different underlying processes/models E.g., k different clusters, k classes, etc. BUT, you don’t get to “see” which point was generated by which process Only get the X for each point; the y is hidden Want to build complete data model from k different “cluster specific” models:
Mixture models This form is called a “mixture model” “mixture” of k sub-models Equivalent to the process: Roll a weighted die (weighted by α i ); choose the corresponding sub-model; generate a data point from that sub-model Example: mixture of Gaussians:
Parameterizing a mixture How do you find the params, etc? Simple answer: use maximum likelihood: Write down joint likelihood function Differentiate Set equal to 0 Solve for params Unfortunately... It doesn’t work in this case Good exercise: try it and see why it breaks Answer: Expectation Maximization
Expectation- Maximization General method for doing maximum likelihood in the presence of hidden variables Identified by Dempster, Laird, & Rubin (1977) Called the “EM algorithm”, but is really more of a “meta-algorithm”: recipe for writing algorithms Works in general when you have: Probability distribution over some data set Missing feature/label vals for some/all data points Special cases: Gaussian mixtures Hidden Markov models Kalmann fliters POMDPs
The Gaussian mixture case Assume: data generated from 1-d mixture of Gaussians: Whole data set: Introduce a “responsibility” variable: If you know model params, can calculate responsibilities
Parameterizing responsibly Assume you know the responsibilities, z ij Can use this to find parameters for each Gaussian (think about special case where z ij =0 or 1 ):