Announcements Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
Unsupervised Learning
CS479/679 Pattern Recognition Dr. George Bebis
Expectation Maximization
Supervised Learning Recap
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Computer vision: models, learning and inference
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Data Mining Techniques Outline
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Computer vision: models, learning and inference
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 Learning with Bayesian Networks Author: David Heckerman Presented by Yan Zhang April
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Lecture 19: More EM Machine Learning April 15, 2010.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Randomized Algorithms for Bayesian Hierarchical Clustering
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Lecture 2: Statistical learning primer for biologists
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Spring 2014 Course PSYC Thinking Proseminar – Matt Jones Provides beginning Ph.D. students with a basic introduction to research on complex human.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Oliver Schulte Machine Learning 726
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
CSCI 5822 Probabilistic Models of Human and Machine Learning
Markov Networks.
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Learning Bayesian networks
Presentation transcript:

Announcements Spring Courses Somewhat Relevant to Machine Learning 5314: Algorithms for molecular bio (who’s teaching?) 5446: Chaotic dynamics (Bradley) 5454: Algorithms (Frangillo) 5502: Data mining (Lv) 5753: Computer performance modeling (Grunwald) : Geospatial data analysis (Caleb Phillips) : Human-robot interaction (Dan Szafir) : Data analytics: Systems algorithms and applications (Lv) : Bioinformatics (Robin Dowell-Dean) Homework  Importance sampling via likelihood weighting

Learning In Bayesian Networks: Missing Data And Hidden Variables

Missing Vs. Hidden Variables Missing  often known but absent for certain data points  missing at random or missing based on value e.g., netflix ratings Hidden  never observed but essential for predicting visible variables e.g., human memory state  a.k.a. latent variables

Quiz “Semisupervised learning” concerns learning where additional input examples are available, but labels are not. According to the model below, will partial data (either X or Y) inform the model parameters? X known? Y known? X Y θ y|x θxθx θ y|~x XX Y

X Y θ y|x θxθx θ y|~x X Y

Missing Data: Exact Inference In Bayes Net Y: observed variables Z: unobserved variables How do we do parameter updates for θ i in this case? If X i and Pa i are observed, then situation is straightforward (e.g., like single-coin toss case). If X i or any Pa i are missing, need to marginalize over Z E.g., X i ~ Categorical(θ ij ) Note: posterior is a Dirichlet mixture Dirichlet # values of X i Specific value of X i Dirichlet X = {Y,Z} parameter vector for X i with parent configuration j

Missing Data: Gibbs Sampling Given a set of observed incomplete data, D = {y 1,..., y N } 1. Fill in arbitrary values for unobserved variables for each case  D c 2. For each unobserved variable z i in case n, sample: 3. evaluate posterior density on complete data D c ’ 4. repeat steps 2 and 3, and compute mean of posterior density

Missing Data: Gaussian Approximation Approximate as a multivariate Gaussian. Appropriate if sample size |D| is large, which is also the case when Monte Carlo is inefficient 1. find the MAP configuration by maximizing g(.) 2. approximate using 2 nd degree Taylor polynomial 3. leads to approximate result that is Gaussian ~ negative Hessian of g(.) eval at ~

Missing Data: Further Approximations As the data sample size increases,  Gaussian peak becomes sharper, so can make predictions based on the MAP configuration  can ignore priors (diminishing importance) -> max likelihood How to do ML estimation  Expectation Maximization  Gradient methods

Expectation Maximization Scheme for picking values of missing data and hidden variables that maximizes data likelihood E.g., population of Laughing Goat  baby stroller, diapers, lycra pants  backpack, saggy pants  baby stroller, diapers  backpack, computer, saggy pants  diapers, lycra  computer, saggy pants  backpack, saggy pants

Expectation Maximization Formally  V: visible variables  H: hidden variables  θ: model parameters Model  P(V,H|θ) Goal  Learn model parameters θ in the absence of H Approach  Find θ that maximizes P(V|θ)

EM Algorithm (Barber, Chapter 11)

EM Algorithm Guaranteed to find local optimum of θ Sketch of proof  Bound on marginal likelihood equality only when q(h|v)=p(h|v,θ)  E-step: for fixed θ, find q(h|v) that maximizes RHS  M-step: for fixed q, find θ that maximizes RHS  if each step maximizes RHS, it’s also improving LHS technically, it’s not lowering LHS

Barber Example Contours are of the lower bound Note alternating steps along θ and q axes  note that steps are not gradient steps and can be large Choice of initial θ determines local likelihood optimum

Clustering: K-Means Vs. EM K means 1.choose some initial values of μ k 2.assign each data point to the closest cluster 3.recalculate the μ k to be the means of the set of points assigned to cluster k 4.iterate to step 2

K-means Clustering From C. Bishop, Pattern Recognition and Machine Learning

K-means Clustering

Clustering: K-Means Vs. EM K means 1.choose some initial values of μ k 2.assign each data point to the closest cluster 3.recalculate the μ k to be the means of the set of points assigned to cluster k 4.iterate to step 2

Clustering: K-Means Vs. EM EM 1.choose some initial values of μ k 2.probabilistically assign each data point to clusters 1. P(Z=k|μ) 3.recalculate the μ k to be the weighted mean of the set of points 1. weight by P(Z=k|μ) 4.iterate to step 2

EM for Gaussian Mixtures

Variational Bayes Generalization of EM  also deals with missing data and hidden variables Produces posterior on parameters  not just ML solution Basic (0 th order) idea  do EM to obtain estimates of p(θ) rather than θ directly

Variational Bayes Assume factorized approximation of joint hidden and parameter posterior: Find marginals that make this approximation as close as possible. Advantage?  Bayesian Occam’s razor: vaguely specified parameter is a simpler model -> reduces overfitting

Gradient Methods Useful for continuous parameters θ Make small incremental steps to maximize the likelihood Gradient update: swap

All Learning Methods Apply To Arbitrary Local Distribution Functions Local distribution function performs either  Probabilistic classification (discrete RVs)  Probabilistic regression (continuous RVs) Complete flexibility in specifying local distribution fn  Analytical function (e.g., homework 5)  Look up table  Logistic regression  Neural net  Etc. LOCAL DISTRIBUTION FUNCTION

Summary Of Learning Section  Given model structure and probabilities, inferring latent variables  Given model structure, learning model probabilities  Complete data  Missing data  Learning model structure

Learning Model Structure

Learning Structure and Parameters The principle Treat network structure, S h, as a discrete RV Calculate structure posterior Integrate over uncertainty in structure to predict The practice Computing marginal likelihood, p(D|S h ), can be difficult. Learning structure can be impractical due to the large number of hypotheses (more than exponential in # of nodes)

source:

Approach to Structure Learning  model selection find a good model, and treat it as the correct model  selective model averaging select a manageable number of candidate models and pretend that these models are exhaustive Experimentally, both of these approaches produce good results. i.e., good generalization

SLIDES STOLEN FROM DAVID HECKERMAN

Interpretation of Marginal Likelihood Using chain rule for probabilities Maximizing marginal likelihood also maximizes sequential prediction ability! Relation to leave-one-out cross validation Problems with cross validation  can overfit the data, possibly because of interchanges (each item is used for training and for testing each other item)  has a hard time dealing with temporal sequence data

Coin Example

α h, α t, #h, and #t all indexed by these conditions

# parent config # nodes # node states

Computation of Marginal Likelihood Efficient closed form solution if  no missing data (including no hidden variables)  mutual independence of parameters θ  local distribution functions from the exponential family (binomial, Poisson, gamma, Gaussian, etc.)  conjugate priors

Computation of Marginal Likelihood Approximation techniques must be used otherwise. E.g., for missing data can use Gibbs sampling or Gaussian approximation described earlier. Bayes theorem 1. Evaluate numerator directly, estimate denominator using Gibbs sampling 2. For large amounts of data, numerator can be approximated by a multivariate Gaussian

Structure Priors Hypothesis equivalence identify equivalence class of a given network structure All possible structures equally likely Partial specification: required and prohibited arcs (based on causal knowledge) Ordering of variables + independence assumptions ordering based on e.g., temporal precedence presence or absence of arcs are mutually independent -> n(n-1)/2 priors p(m) ~ similarity(m, prior Belief Net)

Parameter Priors all uniform: Beta(1,1) use a prior Belief Net parameters depend only on local structure

Model Search Finding the belief net structure with highest score among those structures with at most k parents is NP-hard for k > 1 (Chickering, 1995) Sequential search  add, remove, reverse arcs  ensure no directed cycles  efficient in that changes to arcs affect only some components of p(D|M) Heuristic methods  greedy  greedy with restarts  MCMC / simulated annealing

two most likely structures

2x10 10