Empirical risk minimization

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Regularized risk minimization
On-line learning and Boosting
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
The loss function, the normal equation,
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Visual Recognition Tutorial
Instructor : Dr. Saeed Shiry
Classification and risk prediction
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.
Visual Recognition Tutorial
Crash Course on Machine Learning
Collaborative Filtering Matrix Factorization Approach
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression) Pushpak Bhattacharyya CSE Dept., IIT Bombay.
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Linear Models for Classification
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning 5. Parametric Methods.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Data Modeling Patrice Koehl Department of Biological Sciences
Applied statistics Usman Roshan.
Regression Usman Roshan.
Support vector machines
Chapter 7. Classification and Prediction
Usman Roshan CS 675 Machine Learning
Deep Feedforward Networks
Dan Roth Department of Computer and Information Science
Probability Theory and Parameter Estimation I
Dan Roth Department of Computer and Information Science
Empirical risk minimization
Bounding the error of misclassification
Ch3: Model Building through Regression
Regularized risk minimization
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Regression Usman Roshan.
The loss function, the normal equation,
Support vector machines
Mathematical Foundations of BME Reza Shadmehr
Support vector machines
Learning From Observed Data
Presentation transcript:

Empirical risk minimization Usman Roshan

Supervised learning for two classes We are given n training samples (xi,yi) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each xi is a d-dimensional vector (xi in Rd) and yi is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples xi’ in Rd for i=1..n’ also drawn i.i.d from P(x,y)

Loss function Loss function: c(x,y,f(x)) Maps to [0,inf] Examples:

Test error We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: We want to find f that minimizes this but we need P(y|x) which we don’t have access to.

Expected risk Suppose we don’t have test data (x’). Then we average the test error over all possible data points x This is also known as the expected risk or the expected value of the loss function in Bayesian decision theory We want to find f that minimizes this but we don’t have all data points. We only have training data. And we don’t know P(y,x)

Empirical risk Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution pemp(x,y) The delta function δx(y)=1 if x=y and 0 otherwise.

Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.

Example of minimizing empirical risk (least squares) Suppose we are given n data points (xi,yi) where each xi in Rd and yi in R. We want to determine a linear function f(x)=ax+b for predicting test points. Loss function c(xi,yi,f(xi))=(yi-f(xi))2 What is the empirical risk?

Empirical risk for least squares Now finding f has reduced to finding a and b. Since this function is convex in a and b we know there is a global optimum which is easy to find by setting first derivatives to 0.

Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M)) Set the loss function to Now minimizing the empirical risk is the same as maximizing the likelihood

Empirical risk We pose the empirical risk in terms of a loss function and go about to solve it to get our classifier. Input: n training samples xi each of dimension d along with labels yi Output: a linear function f(x)=wTx+w0 that minimizes the empirical risk

Empirical risk examples Linear regression How about logistic regression?

Logistic regression In the logistic regression model: Let y=+1 be case and y=-1 be control. The sample likelihood of the training data is given by

Logistic regression We find our parameters w and w0 by maximizing the likelihood or minimizing the -log(likelihood). The -log of the likelihood is

Logistic regression empirical risk

Hinge loss empirical risk

Different empirical risks Linear regression Logistic regression Hinge

Does it make sense to optimize empirical risk? Does Remp(f) approach R(f) as we increase sample size? Remember that f is our classifier. We are asking if the empirical error of f approaches the expected error of f with more samples. Yes according to law of large numbers: mean value of random sample approaches true mean as sample size increases But how fast does it converge?

Chernoff bounds Suppose we have Xi i.i.d. trials where each Xi = |f(xi)-yi| Let m be the true mean of X Then

Convergence issues Applying Chernoff bound to empirical and expected risk give us Remember we fix f first before looking at data. So this is not too helpful. We want to show a bound with the best function estimation

Bound on empirical risk minimization In other words bound: With some work we can show that where N(F,2n) measures the size of function space F. It is the maximum size of F on 2n datapoints. Since we can have at most 22n binary classifiers on 2n points the maximum size of F is 22n .

Structural risk/regularized risk minimization We can rewrite the previous bound as Compare to regularized risk minimization