Lecture 2: Statistical learning primer for biologists

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
Unsupervised Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Supervised Learning Recap
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Machine Learning CMPT 726 Simon Fraser University
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Visual Recognition Tutorial
Today Logistic Regression Decision Trees Redux Graphical Models
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Gaussian Mixture Models and Expectation Maximization.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
CSC2535 Spring 2013 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Presenter : Kuang-Jui Hsu Date : 2011/5/23(Tues.).
Model Inference and Averaging
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Machine Learning 5. Parametric Methods.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Pattern Recognition and Machine Learning
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Review of statistical modeling and probability theory Alan Moses ML4bio.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
CS 2750: Machine Learning Directed Graphical Models
Probability Theory and Parameter Estimation I
ICS 280 Learning in Graphical Models
Ch3: Model Building through Regression
Classification of unlabeled data:
Special Topics In Scientific Computing
Latent Variables, Mixture Models and EM
Prof. Adriana Kovashka University of Pittsburgh April 4, 2017
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Pattern Recognition and Machine Learning
Markov Random Fields Presented by: Vladan Radosavljevic.
Expectation-Maximization & Belief Propagation
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Presentation transcript:

Lecture 2: Statistical learning primer for biologists Alan Qi Purdue Statistics and CS Jan. 15, 2009

Outline Basics for probability Regression Graphical models: Bayesian networks and Markov random fields Unsupervised learning: K-means and Expectation maximization

Probability Theory Sum Rule Product Rule

The Rules of Probability Sum Rule Product Rule

Bayes’ Theorem posterior  likelihood × prior

Probability Density & Cumulative Distribution Functions

Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)

Variances and Covariances

The Gaussian Distribution

Gaussian Mean and Variance

The Multivariate Gaussian

Gaussian Parameter Estimation Likelihood function

Maximum (Log) Likelihood

Properties of and Unbiased Biased

Curve Fitting Re-visited

Maximum Likelihood Determine by minimizing sum-of-squares error, .

Predictive Distribution

MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error, .

Bayesian Curve Fitting

Bayesian Networks Directed Acyclic Graph (DAG)

Bayesian Networks General Factorization

Generative Models Causal process for generating images

Discrete Variables (1) General joint distribution: K 2 -1 parameters Independent joint distribution: 2(K-1) parameters

Discrete Variables (2) General joint distribution over M variables: KM -1 parameters M node Markov chain: K-1+(M-1)K(K-1) parameters

Discrete Variables: Bayesian Parameters (1)

Discrete Variables: Bayesian Parameters (2) Shared prior

Parameterized Conditional Distributions If are discrete, K-state variables, in general has O(K M) parameters. The parameterized form requires only M + 1 parameters

Conditional Independence a is independent of b given c Equivalently Notation

Conditional Independence: Example 1

Conditional Independence: Example 1

Conditional Independence: Example 2

Conditional Independence: Example 2

Conditional Independence: Example 3 Note: this is the opposite of Example 1, with c unobserved.

Conditional Independence: Example 3 Note: this is the opposite of Example 1, with c observed.

“Am I out of fuel?” B = Battery (0=flat, 1=fully charged) And hence B = Battery (0=flat, 1=fully charged) F = Fuel Tank (0=empty, 1=full) G = Fuel Gauge Reading (0=empty, 1=full)

“Am I out of fuel?” Probability of an empty tank increased by observing G = 0.

“Am I out of fuel?” Probability of an empty tank reduced by observing B = 0. This referred to as “explaining away”.

The Markov Blanket Factors independent of xi cancel between numerator and denominator.

Markov Random Fields Markov Blanket

Cliques and Maximal Cliques

Joint Distribution where is the potential over clique C and is the normalization coefficient; note: M K-state variables  KM terms in Z. Energies and the Boltzmann distribution

Illustration: Image De-Noising (1) Original Image Noisy Image

Illustration: Image De-Noising (2)

Illustration: Image De-Noising (3) Noisy Image Restored Image (ICM)

Converting Directed to Undirected Graphs (1)

Converting Directed to Undirected Graphs (2) Additional links: “marrying parents”, i.e., moralization

Directed vs. Undirected Graphs (2)

Inference on a Chain Computational time increases exponentially with N.

Inference on a Chain

Supervised Learning Supervised learning: learning with examples or labels, e.g., classification and regression Linear regression (the example we just given), Generalized linear models (e.g, probit classification), Support vector machines, Gaussian processes classifications, etc. Take CS590M-Machine Learning in 2009 fall.

Unsupervised Learning Supervised learning: learning with examples or labels, e.g., classification and regression Unsupervised learning: learning without examples or labels, e.g., clustering, mixture models, PCA, non-negative matrix factorization

K-means Clustering: Goal

Cost Function

Two Stage Updates

Optimizing Cluster Assignment

Optimizing Cluster Centers

Convergence of Iterative Updates

Example of K-Means Clustering

Mixture of Gaussians Mixture of Gaussians: Introduce latent variables: Marginal distribution:

Conditional Probability Responsibility that component k takes for explaining the observation.

Maximum Likelihood Maximize the log likelihood function

Maximum Likelihood Conditions (1) Setting the derivatives of to zero:

Maximum Likelihood Conditions (2) Setting the derivative of to zero:

Maximum Likelihood Conditions (3) Lagrange function: Setting its derivative to zero and use the normalization constraint, we obtain:

Expectation Maximization for Mixture Gaussians Although the previous conditions do not provide closed-form conditions, we can use them to construct iterative updates: E step: Compute responsibilities . M step: Compute new mean , variance , and mixing coefficients . Loop over E and M steps until the log likelihood stops to increase.

Example EM on the Old Faithful data set.

General EM Algorithm

EM as Lower Bounding Methods Goal: maximize Define: We have

Lower Bound is a functional of the distribution . Since and , is a lower bound of the log likelihood function .

Illustration of Lower Bound

Lower Bound Perspective of EM Expectation Step: Maximizing the functional lower bound over the distribution . Maximization Step: Maximizing the lower bound over the parameters .

Illustration of EM Updates