Learning Classifiers For Non-IID Data

Slides:

Advertisements

Similar presentations

Copyright © 2009 Siemens Medical Solutions USA, Inc. All rights reserved. Mining Medical Images R. Bharat Rao Glenn Fung Balaji Krishnapuram Jinbo Bi Murat.

Advertisements

Linear Regression.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Minimum Redundancy and Maximum Relevance Feature Selection

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

CMPUT 466/551 Principal Source: CMU

COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Chapter 4: Linear Models for Classification

Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Visual Recognition Tutorial

A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.

Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.

Copyright © Siemens Medical Solutions, USA, Inc.; All rights reserved. Polyhedral Classifier for Target Detection A Case Study: Colorectal Cancer.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

For internal use only / Copyright © Siemens AG All rights reserved. Multiple-instance learning improves CAD detection of masses in digital mammography.

1 Validation and Verification of Simulation Models.

Visual Recognition Tutorial

Thanks to Nir Friedman, HU

Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.

Maximum likelihood (ML)

Review of Lecture Two Linear Regression Normal Equation

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Learning Classifiers for Computer Aided Diagnosis Using Local Correlations Glenn Fung, Computer-Aided Diagnosis and Therapy Siemens Medical Solutions,

On ranking in survival analysis: Bounds on the concordance index

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Designing Efficient Cascaded Classifiers: Tradeoff between Accuracy and Cost Vikas Raykar Balaji Krishnapuram Shipeng Yu Siemens Healthcare KDD 2010 TexPoint.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.

Classification Ensemble Methods 1

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.

Markov Chain Monte Carlo in R

Data Science Credibility: Evaluating What’s Been Learned

CSE 4705 Artificial Intelligence

Machine Learning – Classification David Fenyő

Qian Liu CSE spring University of Pennsylvania

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Ch3: Model Building through Regression

Linear Mixed Models in JMP Pro

Lecture 09: Gaussian Processes

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Recognition and Image Analysis

Generally Discriminant Analysis

Lecture 10: Gaussian Processes

LECTURE 07: BAYESIAN ESTIMATION

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Multivariate Methods Berlin Chen

Data Exploration and Pattern Recognition © R. El-Yaniv

Presentation transcript:

Learning Classifiers For Non-IID Data Balaji Krishnapuram, Computer-Aided Diagnosis and Therapy Siemens Medical Solutions, Inc. Collaborators: Volkan Vural, Jennifer Dy [North Eastern], Ya Xue [Duke], Murat Dundar, Glenn Fung, Bharat Rao [Siemens] Jun 27, 2006

Outline Implicit IID assumption in traditional classifier design Often, not valid in real life. Motivating CAD problems Convex algorithms for Multiple Instance Learning (MIL) Bayesian algorithms for Batch-wise classification Faster, approximate algorithms via mathematical programming Summary / Conclusions

IID assumption in classifier design Training data D={(xi,yi)i=1N: xi 2 Rd, yi 2 {+1,-1}}, Testing data T ={(xi,yi)i=1M: xi 2 Rd, yi 2 {+1,-1}}, Assume each training/testing sample drawn independently from identical distribution: (xi,yi) ~ PXY(x,y) This is why we can classify one test sample at a time, ignoring the features of the other test samples Eg. Logistic Regression: P(yi=1|xi,w)=1/(1+exp(-wT xi))

Evaluating classifiers: learning-theory Binomial test set bounds: With high probability over the random draw of M samples in testing set T, if M large and a classifier w is observed to be accurate on T, with high probability its expected accuracy over a random draw of a sample from PXY(x,y) will be high If the IID assumption fails, all bets are off ! Thought experiment: repeat same test sample M times

Training classifiers: learning theory With high probability over the random draw of N samples in training set D, the expected accuracy on a random sample from PXY(x,y) for the learnt classifier w will be high iff accurate on the training set D; and N large satisfies intuition before seeing data (“prior”, large margin etc) PAC-Bayes, VC-theory etc rely on iid assumption Relaxation to exchangeability being explored

CAD: Correlations among candidate ROI

Hierarchical Correlation Among Samples

Additive Random Effect Models The classification is treated as iid, but only if given both Fixed effects (unique to sample) Random effects (shared among samples) Simple additive model to explain the correlations P(yi|xi,w,ri,v)=1/(1+exp(-wT xi –vT ri)) P(yi|xi,w,ri)=s P(yi|xi,w,ri,v) p(v|D) dv Sharing vT ri among many samples  correlated prediction …But only small improvements in real-life applications

CAD detects early stage colon cancer

Candidate Specific Random Effects Model: Polyps Sensitivity Specificity

CAD algorithms: domain-specific issues Multiple (correlated) views: one detection is sufficient Systemic treatment of diseases: e.g. detecting one PE sufficient Modeling the data acquisition mechanism Errors in guessing class labels for training set.

The Multiple Instance Learning Problem A bag is a collection of many instances (samples) The class label is provided for bags, not instances Positive bag has at least one +ve instance in it Examples of “bag” definition for CAD applications: Bag=samples from multiple views, for the same region Bag=all candidates referring to same underlying structure Bag=all candidates from a patient

CH-MIL Algorithm: 2-D illustration

CH-MIL Algorithm for Fisher’s Discriminant Easy implementation via Alternating Optimization Scales well to very large datasets Convex problem with unique optima

Lung Nodules& Pulmonary Emboli Lung CAD *Pending FDA Approval Computed Tomography AX Lung Nodules& Pulmonary Emboli DR CAD

CH-MIL: Pulmonary Embolisms

CH-MIL: Polyps in Colon

Classifying a Correlated Batch of Samples Let classification of individual samples xi be based on ui Eg. Linear ui = wT xi ; or kernel-predictor ui= j=1N j k(xi,xj) Instead of basing the classification on ui, we will base it on an unobserved (latent) random variable zi Prior: Even before observing any features xi (thus before ui), zi are known to be correlated a-priori, p(z)=N(z|0,) Eg. due to spatial adjacency  = exp(-D), Matrix D=pair-wise dist. between samples

Classifying a Correlated Batch of Samples Prior: Even before observing any features xi (thus before ui), zi are known to be correlated a-priori, p(z)=N(z|0,) Likelihood: Let us claim that ui is really a noisy observation of a random variable zi : p(ui|zi)=N(ui|zi, 2) Posterior: remains correlated, even after observing the features xi P(z|u)=N(z|(-12+I)-1u, (-1+2I)-1) Intuition: E[zi]=j=1N Aij uj ; A=(-12+I)-1

SVM-like Approximate Algorithm Intuition: classify using E[zi]=j=1N Aij uj ; A=(-12+I)-1 What if we used A=(  + I) instead? Reduces computation by avoiding inversion. Not principled, but a heuristic for speed. Yields an SVM-like mathematical programming algorithm:

Detecting Polyps in Colon

Detecting Pulmonary Embolisms

Detecting Nodules in the Lung

Conclusions IID assumption is universal in ML Often violated in real life, but ignored Explicit modeling can substantially improve accuracy Described 3 models in this talk, utilizing varying levels of information Additive Random Effects Models: weak correlation information Multiple Instance Learning: stronger correlations enforced Batch-wise classification models: explicit information Statistically significant improvement in accuracy Only starting to scratch the surface, lots to improve!