Usman Roshan CS 675 Machine Learning

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Regularized risk minimization
Basics of Statistical Estimation
Copula Regression By Rahul A. Parsa Drake University &
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Machine Learning CMPT 726 Simon Fraser University
Bayesian Learning, Cont’d. Administrivia Various homework bugs: Due: Oct 12 (Tues) not 9 (Sat) Problem 3 should read: (duh) (some) info on naive Bayes.
Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
Bayesian Learning, cont’d. Administrivia Homework 1 returned today (details in a second) Reading 2 assigned today S. Thrun, Learning occupancy grids with.
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
Crash Course on Machine Learning
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
HMM - Part 2 The EM algorithm Continuous density HMM.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Data Modeling Patrice Koehl Department of Biological Sciences
Applied statistics Usman Roshan.
Regression Usman Roshan.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Applied statistics Usman Roshan.
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Empirical risk minimization
Bounding the error of misclassification
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Classification of unlabeled data:
Regularized risk minimization
Latent Variables, Mixture Models and EM
Ying shen Sse, tongji university Sep. 2016
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
EE513 Audio Signals and Systems
Generally Discriminant Analysis
Regression Usman Roshan.
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Learning From Observed Data
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Data Exploration and Pattern Recognition © R. El-Yaniv
Probabilistic Surrogate Models
Presentation transcript:

Usman Roshan CS 675 Machine Learning Bayesian learning Usman Roshan CS 675 Machine Learning

Supervised learning for two classes We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each xi is a d-dimensional vector ( ) and yi is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples xi’ (in Rd ) for i=1..n’ also drawn i.i.d from P(x,y)

Classification: Bayesian learning Bayes rule: To classify a given datapoint x we select the model (class) Mi with the highest P(Mi|x) The denominator is a normalizing term and does not affect the classification of a given datapoint. Therefore P(x|M) is called the likelihood and P(M) is the prior probability. To classify a given datapoint x we need to know the likelihood and the prior. If priors P(M) are uniform (the same) then finding the model that maximizes P(M|D) is the same as finding M that maximizes the likelihood P(D|M).

Maximum likelihood We can classify by simply selecting the model M that has the highest P(M|D) where D=data, M=model. Thus classification can also be framed as the problem of finding M that maximizes P(M|D) By Bayes rule:

Maximum likelihood Suppose we have k models to consider and each has the same probability. In other words we have a uniform prior distribution P(M)=1/k. Then In this case we can solve the classification problem by finding the model that maximizes P(D|M). This is called the maximum likelihood optimization criterion.

Maximum likelihood Suppose we have n i.i.d. samples (xi,yi) drawn from M. The likelihood P(D|M) is Consequently the log likelihood is

Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M)) Set the loss function to Now minimizing the empirical risk is the same as maximizing the likelihood (return to this later again)

Maximum likelihood example Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) We want to determine the probability P(H) of the coin that produces k heads and n-k tails? We are given some tosses (training data): HTHHHTHHHTHTH. Solution: Form the log likelihood Differentiate w.r.t. p Set to the derivative to 0 and solve for p

Maximum likelihood example  

Classification by likelihood Suppose we have two classes C1 and C2. Compute the likelihoods P(D|C1) and P(D|C2). To classify test data D’ assign it to class C1 if P(D|C1) is greater than P(D|C2) and C2 otherwise.

Gaussian models Assume that class likelihood is represented by a Gaussian distribution with parameters μ (mean) and σ (standard deviation) We find the model (in other words mean and variance) that maximize the likelihood (or equivalently the log likelihood). Suppose we are given training points x1,x2,…,xn1 from class C1. Assuming that each datapoint is drawn independently from C1 the sample log likelihood is

Gaussian models The log likelihood is given by By setting the first derivatives dP/dμ1 and dP/dσ1 to 0. This gives us the maximum likelihood estimate of μ1 and σ1 (denoted as m1 and s1 respectively) Similarly we determine m2 and s2 for class C2.

Gaussian models After having determined class parameters for C1 and C2 we can classify a given datapoint by evaluating P(x|C1) and P(x|C2) and assigning it to the class with the higher likelihood (or log likelihood). The likelihood can also be used as a loss function and has an equivalent representation in empirical risk minimization (return to this later).

Gaussian classification example Consider one dimensional data for two classes (SNP genotypes for case and control subjects). Case (class C1): 1, 1, 2, 1, 0, 2 Control (class C2): 0, 1, 0, 0, 1, 1 Under the Gaussian assumption case and control classes are represented by Gaussian distributions with parameters (μ1, σ1) and (μ2, σ2) respectively. The maximum likelihood estimates of means are

Gaussian classification example The estimates of class standard deviations are Similarly s2=.25 Which class does x=1 belong to? What about x=0 and x=2? What happens if class variances are equal?

Multivariate Gaussian classification Suppose each datapoint is an m-dimensional vector. In the previous example we would have m SNP genotypes instead of one. The class likelihood is given by Where Σ1 is the class covariance matrix. Σ1 is of dimensiona d x d. The (i,j)th entry of Σ1 is the covariance of the ith and jth variable.

Multivariate Gaussian classification The maximum likelihood estimates of η1 and Σ1 are The class log likelihoods with estimated parameters (ignoring constant terms) are

Multivariate Gaussian classification If S1=S2 then the class log likelihoods with estimated parameters (ignoring constant terms) are Depends on distance to means.

Naïve Bayes algorithm If we assume that variables are independent (no interaction between SNPs) then the off-diagonal terms of S are zero and the log likelihood becomes (ignoring constant terms)

Nearest means classifier If we assume all variances sj to be equal then (ignoring constant terms) we get

Gaussian classification example Consider three SNP genotype for case and control subjects. Case (class C1): (1,2,0), (2,2,0), (2,2,0), (2,1,1), (0,2,1), (2,1,0) Control (class C2): (0,1,2), (1,1,1), (1,0,2), (1,0,0), (0,0,2), (0,1,0) Classify (1,2,1) and (0,0,1) with the nearest means classifier