Decision Theory Naïve Bayes ROC Curves

Slides:

Advertisements

Similar presentations

Evaluating Classifiers

Advertisements

Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.

Pattern Recognition and Machine Learning

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Curva ROC figuras esquemáticas Curva ROC figuras esquemáticas Prof. Ivan Balducci FOSJC / Unesp.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Chapter 4: Linear Models for Classification

Naïve Bayes Classifier

Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections

Classification and risk prediction

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.

Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.

Crash Course on Machine Learning

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Principles of Pattern Recognition

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Naive Bayes Classifier

2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

URL:.../publications/courses/ece_8443/lectures/current/exam/2004/ ECE 8443 – Pattern Recognition LECTURE 15: EXAM NO. 1 (CHAP. 2) Spring 2004 Solutions:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Linear Models for Classification

Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.

Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.

Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.

By Using Statistical Models to Detect the Characteristics of Human Face 利用統計模型在彩色圖像中偵測人臉特徵逄霖生中國文化大學電機工程學系.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Evaluating Classification Performance

ECE 8443 – Pattern Recognition LECTURE 04: PERFORMANCE BOUNDS Objectives: Typical Examples Performance Bounds ROC Curves Resources: D.H.S.: Chapter 2 (Part.

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.

Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #50 Binary.

Regression Variance-Bias Trade-off. Regression We need a regression function h(x) We need a loss function L(h(x),y) We have a true distribution p(x,y)

Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Lecture 2. Bayesian Decision Theory

Lecture 1.31 Criteria for optimal reception of radio signals.

Data Mining Classification: Alternative Techniques

LECTURE 05: THRESHOLD DECODING

LECTURE 05: THRESHOLD DECODING

Pattern Recognition and Image Analysis

Pattern Recognition and Machine Learning

Mathematical Foundations of BME

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 05: THRESHOLD DECODING

Hairong Qi, Gonzalez Family Professor

Roc curves By Vittoria Cozza, matr

Bayesian Decision Theory

Hairong Qi, Gonzalez Family Professor

Presentation transcript:

Decision Theory Naïve Bayes ROC Curves

Generative vs Discriminative Methods Logistic regression: h: xy. When we only learn a mapping xy it is called a discriminative method. Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class we learn a model over the input distribution. Advantage: leads to regularization for small datasets (but when N is large discriminative methods tend to work better). We can easily combine various sources of information: say we have learned a model for attribute I, and now receive additional information about attribute II, then: Disadvantage: you model more than necessary for making decisions, and input space (x-space) can be very high dimensional. This is called “conditional independence of x|y”. The corresponding classifier is called “Naïve Bayes Classifier”.

Naïve Bayes: decisions This is the “posterior distribution” and it can be used to make a decision on what label to assign to a new data-case. Note that to make a decision you do not need the denominator. If we computed the posterior p(y|xI) first, we can use it as a new prior for the new information xII (prove this as home):

Naïve Bayes: learning What do we need to learn from data? p(y) p(xk|y) for all k A very simple rule is to look at the frequencies in the data: (assuming discrete states) p(y) = [ nr. data-cases with label y] / [ total nr. data-cases ] p(xk=i|y) =[ nr. data-cases in state xk=i and y] / [nr. data-cases with label y ] To regularize we imagine that each state i has a small fractional number of data-cases to begin with (K = total nr. of classes). p(xk=i|y) =[ c + nr. data-cases in state xk=i and y] / [Kc + nr. data-cases with label y ] What difficulties do you expect if we do not assume conditional independence? Does NB over-estimate or under-estimate the uncertainty of its predictions? Practical guideline: work in log-domain:

Loss functions What if it is much more costly to make an error on predicting y=1 vs y=0? Example: y=1 is “patient has cancer”, y=0 means “patient is healthy. Introduce “expected loss function”: Total probability of predicting class j while true class is k with probability p(y=k,x) Rj is the region of x-space where an example is assigned to class j. Predict  cancer healthy 1000 1

Decision surface How shall we choose Rj ? Solution: mimimize E[L] over {Rj}. Take an arbitrary point “x”. Compute for all j and minimize over “j”. Since we minimize for every “x” separately, the total integral is minimal Places where the decision switches belong to the “decision surface”. What matrix L corresponds to the decision rule on slide 2 using the posterior?

ROC Curve Assume 2 classes and 1 attribute. Plot class conditional densities p(xk|y) Shift decision boundary from right to left. As you move the loss will change, so you want to find the point where it is minimized. If L=[0 1; 1 0] where is L minimal? As you shift the true true positive rate (TP) and the false positive rate (FP) change. By plotting the entire curve you can see the tradeoffs. Easily generalized to more attributes if you can find a decision threshold to vary. y=1 y=0 x

Evaluation: ROC curves moving threshold class 0 (negatives) class 1 (positives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives FN = false negatives = # positives classified as negative Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.

Precision-Recall vs. ROC Curve