Mathematical Foundations of BME

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayes rule, priors and maximum a posteriori

Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Chapter 4: Linear Models for Classification

Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.

Visual Recognition Tutorial

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology) CS5540: Computational Techniques for Analyzing Clinical Data.

Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections

Classification and risk prediction

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Basics of discriminant analysis

Basics: Notation: Sum:. PARAMETERS MEAN: Sample Variance: Standard Deviation: * the statistical average * the central tendency * the spread of the values.

Visual Recognition Tutorial

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Statistical analysis and modeling of neural data Lecture 4 Bijan Pesaran 17 Sept, 2007.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Principles of Pattern Recognition

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

EM and expected complete log-likelihood Mixture of Experts

Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

1 E. Fatemizadeh Statistical Pattern Recognition.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI

Linear Models for Classification

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.

Lecture 2. Bayesian Decision Theory

Probability Theory and Parameter Estimation I

Ch3: Model Building through Regression

CH 5: Multivariate Methods

Classification of unlabeled data:

Comp328 tutorial 3 Kai Zhang

Outline Parameter estimation – continued Non-parametric methods.

Classification Discriminant Analysis

More about Posterior Distributions

Mathematical Foundations of BME Reza Shadmehr

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Generally Discriminant Analysis

Mathematical Foundations of BME

Learning Theory Reza Shadmehr

LECTURE 07: BAYESIAN ESTIMATION

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Mathematical Foundations of BME Reza Shadmehr

Mathematical Foundations of BME

Mathematical Foundations of BME Reza Shadmehr

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Presentation transcript:

580.704 Mathematical Foundations of BME Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data

Bayesian classification Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function: Likelihood prior Classify x into the class l that maximizes the posterior probability. marginal

Classification when distributions have equal variance Suppose we wish to classify a person as male or female based on height. What we have: What we want: Assume equal probability of being male or female: female male 160 180 200 0.01 0.02 0.03 0.04 160 180 200 0.005 0.01 0.015 0.02 160 180 200 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Note that the two densities have equal variance

Classification when distributions have equal variance 160 180 200 0.005 0.01 0.015 0.02 160 180 200 -4 -2 2 4

Estimating the decision boundary between data of equal variance Suppose the distributions for the data in each class is a Gaussian. The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with equal variance, then the boundary between any two classes is a line.

Estimating the decision boundary from estimated densities From the data we can get an ML estimate of Gaussian parameters Class 2 Class 1 Each log ratio gives us a line, for a total of 3 lines. The winning class for each region is the class that has the largest numerator in the posterior probability ratio. Class 3

Relationship between Bayesian classification and Fischer discriminant If we have two classes, class -1 and class +1, then the decision boundary is at 0: For the Bayesian classifier, under assumption of equal variance, the decision boundary is at: The Fischer decision boundary is the same as the Bayesian when the two classes have equal variance and equal prior probability.

Classification when distributions have unequal variance What we have: Classification: Assume: 160 180 200 0.005 0.01 0.015 0.02 0.025 160 180 200 0.005 0.01 0.015 0.02 0.025 0.03 0.035 160 180 200 0.2 0.4 0.6 0.8 1 140 160 180 200 0.05 0.1 0.15 0.2 0.25

160 180 200 0.005 0.01 0.015 0.02 0.025 160 180 200 -12 -10 -8 -6 -4 -2

Quadratic discriminant: when data comes from unequal variance Gaussians green red The decision boundary between any two classes is where the log of the ratio is zero. If the data in each class has a Gaussian density with unequal variance, then the boundary between any two classes is a quadratic function of x. blue

Non-parametric estimate of densities: Kernel density estimate -20 -10 10 20 2 4 6 8 Suppose we have points x(i) that belong to class l. Suppose we can’t assume that these points come from a Gaussian distribution. To estimate the density, we need to form a function that assigns a weight to each point x in our space, with the integral of this function equal to 1. It seems that the more data points x(i) we find around x, the more the weight of x should be. The kernel density estimate puts a Gaussian centered at each data point. Where there are more data points, there are more Gaussians, and the sum is the density. Histogram of the sampled data belonging to class l -20 -10 10 20 0.02 0.04 0.06 0.08 ML estimate of a Gaussian density -20 -10 10 20 0.02 0.04 0.06 density estimate using a Gaussian kernel Kernel

Non-parametric estimate of densities: Kernel density estimate green red blue

Classification with missing data Suppose that we have built a Bayesian classifier and are now given a new data point to classify, but that this new data point is missing some of the “features” that we normally expect to see. In the example below, we have two features (x1 and x2), and four classes. The likelihood function is plotted. -4 -2 2 4 6 8 -3 -1 1 3 Suppose that we are given data point (*,-1) to classify. This data point is missing a value for x1. If we assume the missing value is the average of the previously observed x1, then we would estimate it to be about 1. Assuming that the prior probabilities are equal among the four classes, we classify (1,-1) as class c2. However, c4 is a better choice because when x2=-1, c4 is the most likely class as it has the highest likelihood.

Classification with missing data good data bad (or missing) data