A Posteriori Corrections to Classification Methods Włodzisław Duch & Łukasz Itert Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Slides:

Advertisements

Similar presentations

COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

GhostMiner Wine example Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto,

Lecture 3 Nonparametric density estimation and classification

Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Classification and Decision Boundaries

Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Coloring black boxes: visualization of neural network decisions Włodzisław Duch School of Computer Engineering, Nanyang Technological University, Singapore,

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人：黃子齊

Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.

Experimental Evaluation of Learning Algorithms Part 1.

Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.

Lecture 4 Linear machine

1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

Mete Ozay, Fatos T. Yarman Vural —Presented by Tianxiao Jiang

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Lecture 1.31 Criteria for optimal reception of radio signals.

Support Feature Machine for DNA microarray data

ECE 471/571 - Lecture 19 Review 02/24/17.

Lecture 15. Pattern Classification (I): Statistical Formulation

Performance Evaluation 02/15/17

LECTURE 03: DECISION SURFACES

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Parametric Estimation

Computational Intelligence: Methods and Applications

Bayesian Decision Theory

Hairong Qi, Gonzalez Family Professor

Presentation transcript:

A Posteriori Corrections to Classification Methods Włodzisław Duch & Łukasz Itert Department of Informatics, Nicholas Copernicus University, Torun, Poland.

MotivationMotivation So you’ve got you model … but that’s not the end. Try to derive as much information from it as possible. In pathological cases NN, FS, RS and other systems lead to results below the base rates. How to avoid it? In controlled experiment the split was 50%-50%. In real life it is 5-95%. How to deal with it? So your model is accurate; that doesn’t impress me much. How about the costs? Confidence in results? Sensitivity? Specificity? Can you improve it quickly? A posteriori corrections may help and are (almost) for free.

Corrections increasing accuracy K classes, C K is the majority class.  i = 0 gives majority classifier,  i = 1 gives original one. Optimize  i = 0. NN, kNN, FIS, RS & others do not estimate probabilities rigorously, but some estimations of p(C i |X) are obtained. Many systems do not optimize error functions. Idea: linear scaling of probabilities:

SoftmaxSoftmax This will flatten probabilities; for 2 classes: P(C|X)  [(1+e  ) ,(1+e +1 )  ]  [0.27,0.73]. If  i  [0,1] then p. of the majority class may only grow. Solution: assume  i  [0,∞], and  K =1, use softmax

Cost function P i (X) are “true” probabilities, if given, or 1 if the label of the training vector X is C i P i (X) = 0 otherwise kNNs, Kohonen nets, Decision Trees, many fuzzy and rough systems do not minimize such cost function. Alternative: stacking with linear perceptron.

Cost function with linear rescaling Due to normalization:

Minimum of E(  ) - solution Elegant solution is found in the LMS sense.

Numerical example The primate splice-junction DNA gene sequences: 60 nucleotides, distinguish if there is an intron => exon, exon => intron boundary, or neither vectors (2000 training test) kNN (k=11, Manhattan) gave initial probabilities. Before correction: 85.8% (train), 85.7% (test) After correction: 86.4% (train), 86.9% (test)   = 1,0282;   = 0,8785 MSE improvement: better probabilities, even if not always correct answers.

Changes in the a priori class distribution A priori class distribution is different in training/test data. If data comes from the same process the densities p(X|C i ) =const, posteriors p(C i |X) change. Bayes theorem for training p t (C i |X) and test p(C i |X) :

Estimation of a priori probabilities How to estimate new p(C i ) ? Estimate confusion matrix on the training set p t (C i |C j ) (McLachlan and Basford 1988); estimate p test (C) from applying classifier to test data. Solve linear equations: Dataset50-50% p(Ci|Cj)p(Ci|Cj) True p(C i ) Diabetes Breast cancer Bupa Liver Experiment: use MLP on small training sample

What to optimize? Overall accuracy is not always the most important thing to optimize. Given a model M, confusion matrix for a class + and all other classes is (rows=true, columns=predicted by M):

Quantities derived from p(C i |C j ) Several quantities are used to evaluate classification models M created to distinguish C + class:

Error functions Best classifier selected using Recall (Precision) curves or ROC curves Sensitivity(1-Specificity), i.e. S + (1  S  ) Confidence in M may be increased by rejecting some cases

Errors and costs Optimization with explicit costs: For  = 0 this is equivalent to maximization of and for large  to the maximization of

ConclusionsConclusions Applying a trained model in real world application does not end with classification, it may be only the beginning. 3 types of corrections to optimize the final model have been considered: a posteriori, improving accuracy by scaling probabilities restoring the balance between the training/test distributions improving confidence, selectivity or specificity of results. They are especially useful for optimization of logical rules. They may be combined, for example a posteriori corrections may be applied to accuracy for a chosen class (sensitivity), confidence, cost optimization etc.