Loss.

Slides:



Advertisements
Similar presentations
The Simple Regression Model
Advertisements

Pattern Recognition and Machine Learning
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Maximum likelihood (ML) and likelihood ratio (LR) test
PPA 415 – Research Methods in Public Administration
Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.
Maximum likelihood (ML) and likelihood ratio (LR) test
The Simple Regression Model
Machine Learning CMPT 726 Simon Fraser University
Decision Theory Naïve Bayes ROC Curves
Chapter 3 Hypothesis Testing. Curriculum Object Specified the problem based the form of hypothesis Student can arrange for hypothesis step Analyze a problem.
Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.
Maximum likelihood (ML)
5-1 Two Discrete Random Variables Example Two Discrete Random Variables Figure 5-1 Joint probability distribution of X and Y in Example 5-1.
5-1 Two Discrete Random Variables Example Two Discrete Random Variables Figure 5-1 Joint probability distribution of X and Y in Example 5-1.
Modern Navigation Thomas Herring
Chapter 6: Sampling Distributions
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Principles of Pattern Recognition
PBG 650 Advanced Plant Breeding
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Probability Distributions. We need to develop probabilities of all possible distributions instead of just a particular/individual outcome Many probability.
1 E. Fatemizadeh Statistical Pattern Recognition.
Lunch & Learn Statistics By Jay. Goals Introduce / reinforce statistical thinking Understand statistical models Appreciate model assumptions Perform simple.
Regression Analysis Part C Confidence Intervals and Hypothesis Testing
CpSc 881: Machine Learning Evaluating Hypotheses.
Machine Learning Chapter 5. Evaluating Hypotheses
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
Fourier Approximation Related Matters Concerning Fourier Series.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Hypothesis Testing Example 3: Test the hypothesis that the average content of containers of a particular lubricant is 10 litters if the contents of random.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Chapter 6: Sampling Distributions
CHAPTER 12 More About Regression
Lecture 2. Bayesian Decision Theory
Basic Technical Concepts in Machine Learning
Lecture 1.31 Criteria for optimal reception of radio signals.
CHAPTER 6 Random Variables
Lecture 9.
Probability Theory and Parameter Estimation I
Basic simulation methodology
Mixed Costs Chapter 2: Managerial Accounting and Cost Concepts. In this chapter we explain how managers need to rely on different cost classifications.
Consolidation & Review
The distribution function F(x)
Discrete Probability Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
CHAPTER 12 More About Regression
Mathematical Foundations of BME
Simple Linear Regression
LECTURE 23: INFORMATION THEORY REVIEW
The loss function, the normal equation,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Nonparametric density estimation and classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Applied Statistics and Probability for Engineers
Presentation transcript:

Loss

Minimum Expected Loss/Risk If we want to consider more than zero-one loss, then we need to define a loss matrix with elements Lkj specifying the penalty associated with assigning a pattern belonging to class Ck as class Cj (i.e. Read kj as k-> j or ‘’k classified as j’’) Example: classify medical images as ‘cancer’ or ‘normal’ Then, to compute the minimum expected loss, we need to look at the concept of expected value. Decision Truth

Expected Value The expected value of a function f(x), where x has the probability density/mass p(x) is Discrete Continuous For a finite set of data points x1 , . . . ,xn, drawn from the distribution p(x), the expectation can be approximated by the average over the data points:

Reminder: Minimum Misclassification Rate Illustration with more general distributions, showing different error areas.

Minimum Expected Loss/Risk For two classes: Expected loss= ∫R2 L12p(x,C1)dx + ∫R1 L21p(x,C2)dx In general: Regions are chosen to minimize:

Reject Option

Loss for Regression

Regression For regression, the problem is a bit more complicated and we also need the concept of conditional expectation. E[t|x] = S p(t|x) t(x) t

MultiVariable and Conditional Expectations Remember the definition of the expectation of f(x) where x has the probability p(x) : Conditional Expectation (discrete) E[t|x] = S p(t|x) t(x) t

Decision Theory for Regression Inference step Determine . Decision step For given x, make optimal prediction, y(x). Loss function:

The Squared Loss Function If we used the squared loss as loss function: Advanced After some calculations (next slides...), we can show that:

ADVANCED - Explanation: Consider the first term inside the loss: This is equal to: since p(x,t)=p(t|x)p(x) since p(x) doesn’t depend on t, we can move out of the integral; then the integral ∫p(t|x)dt amounts to 1 as we are summing prob.s through all possible t

Advanced: Explanation Consider the second term inside the loss: This is equal to zero: since doesn’t depend on t, we can move out of the integral

ADVANCED: Explanation for last step E[t|x] does not vary with different values of t, so it can be moved out. Notice that you could also immediately see that the expected value of differences from the mean for the random variable t is 0 (first line of the formula).

Important Hence we have: The first term is minimized when we select y(x) as The second term is independent of y(x) and represents the intrinsic variability of the target It is called the intrinsic error.

Alternative approach/explanation Using the squared error as the loss function: We want to choose y(x) to minimize the expected loss:

Solving for y(x), we get: