Multivariate Discriminants Lectures 4.1, 4.2 Harrison B. Prosper Florida State University INFN School of Statistics 2013 Vietri sul Mare (Salerno, Italy)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Data Mining Classification: Alternative Techniques
Support vector machine
Lecture 3 Nonparametric density estimation and classification
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Visual Recognition Tutorial
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Multivariate Analysis A Unified Perspective
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Linear Discriminant Functions Chapter 5 (Duda et al.)
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Ensemble Learning (2), Tree and Forest
Crash Course on Machine Learning
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Data mining and machine learning A brief introduction.
Principles of Pattern Recognition
G. Cowan Lectures on Statistical Data Analysis Lecture 7 page 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
Harrison B. Prosper Workshop on Top Physics, Grenoble Bayesian Statistics in Analysis Harrison B. Prosper Florida State University Workshop on Top Physics:
G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 3: Multivariate Methods (II) 清华大学高能物理研究中心 2010 年 4 月 12—16.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear Models for Classification
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 Introduction to Statistics − Day 2 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
G. Cowan Lectures on Statistical Data Analysis Lecture 6 page 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem 2Random variables and.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Multivariate Methods in Particle Physics Today and Tomorrow Harrison B. Prosper Florida State University 5 November, 2008 ACAT 08, Erice, Sicily.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Helge VossAdvanced Scientific Computing Workshop ETH Multivariate Methods of data analysis Helge Voss Advanced Scientific Computing Workshop ETH.
Lecture 1.31 Criteria for optimal reception of radio signals.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 03: DECISION SURFACES
Multivariate Analysis Past, Present and Future
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Mathematical Foundations of BME
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Multivariate Discriminants Lectures 4.1, 4.2 Harrison B. Prosper Florida State University INFN School of Statistics 2013 Vietri sul Mare (Salerno, Italy) 3 – 7 June,

Outline  Lecture 4.1  Introduction  Classification  In Theory  In Practice  Lecture 4.2  Classification  In Practice 2

Introduction

4 DØ 1995 Top quark discovery x = (A, H T ) Introduction – Multivariate Data

5 Introduction – General Approaches Two general approaches: Machine Learning Given training data T = (y, x) = (y, x) 1,…(y, x) N, a function space { f }, and a constraint on these functions, teach a machine to learn the mapping y = f (x). Bayesian Learning Given training data T, a function space{ f }, the likelihood of the training data, and a prior defined on the space of functions, infer the mapping y = f (x).

6 Machine Learning Choose Function space F = { f (x, w) } ConstraintC Loss function*L Method Find f (x) by minimizing the empirical risk R(w) subject to the constraint C(w) F f (x, w*) C(w)C(w) *The loss function measures the cost of choosing badly

7 Machine Learning Many methods (e.g., neural networks, boosted decision trees, rule-based systems, random forests,…) use the quadratic loss and choose f (x, w*) by minimizing the constrained mean square empirical risk

8 Bayesian Learning Choose Function space F = { f (x, w) } w Likelihoodp(T | w), T = (y, x) Loss functionL Priorp(w) Method Use Bayes’ theorem to assign a probability (density) w p(w |T)= p(T | w) p(w) / p(T) ww = p(y | x, w) p(x | w) p(w) / p(y | x) p(x) ww ~ p(y | x, w) p(w) (assuming p(x | w) = p(x)) to every function in the function space.

9 Bayesian Learning Given, the posterior density p(w | T), and new data x one computes the (predictive) distribution If a definite value for y is needed for every x, this can be obtained by minimizing the (risk) function, which for L = (y – f ) 2 approximates f (x) by the average

10 Suppose that y has only two values 0 and 1, for every x, then reduces to where y = 1 is associated with objects to be kept and y = 0 with objects to be discarded. For example, in an filter, we can reject junk using the (complement of the) rule which is called the Bayes classifier Bayesian Learning if p(1 | x, T) > q accept x

Classification In Theory

12 Classification: Theory Optimality criterion: minimize the error rate,  +  Background density p(x, s) = p(x | s) p(s) Signal density p(x, s) = p(x | s) p(s) x density p (x) x0x0  

13 Classification: Theory The total loss L arising from classification errors is given by where f (x) = 0 defines a decision boundary such that f (x) > 0 defines the acceptance region H(f ) is the Heaviside step function: H(f ) = 1 if f > 0, 0 otherwise Cost of background misclassification Cost of signal misclassification

14 1-D example Minimizing the total loss L with respect to the boundary x 0 Classification: Theory leads to the result: The quantity in brackets is just the likelihood ratio. The result, in the context of hypothesis testing (with p(s) = p(b)), is called the Neyman-Pearson lemma (1933)

15 The ratio Classification: Theory is called the Bayes discriminant because of its close connection to Bayes’ theorem:

Classification: The Bayes Connection 16 Consider the mean squared risk in the limit N infinity, where we have written p(y | x) = p(y, x) / p(x) and where we have assumed that the effect of the constraint (in this limit) is negligible.

Classification: The Bayes Connection 17 Now minimize the functional R[ f ] with respect to f. If the function f is sufficiently flexible, then R[ f ] will reach its absolute minimum. Then for any small change δ f in f If we require the above to hold for all variations δ f, for all x, then the term in brackets must be zero.

Classification: The Bayes Connection 18 Since for the signal class s, y = 1, while for the background, b, y = 0, we obtain the important result: See, Ruck et al., IEEE Trans. Neural Networks 4, (1990); Wan, IEEE Trans. Neural Networks 4, (1990); Richard and Lippmann, Neural Computation. 3, (1991) In summary: 1.Given sufficient training data T and 2.a sufficiently flexible function f (x, w), then f (x, w) will approximate p(s | x), if y = 1 is assigned to objects of class s and y = 0 is assigned to objects of class b

19 In practice, we typically do not use p(s | x) directly, but rather the discriminant This is fine because p(s | x) is a one-to-one function of D(x) and therefore both have the same discrimination power Classification: The Discriminant

Classification In Practice

Classification: In Practice Here is a short list of multivariate (MVA) methods that can be used for classification:  Random Grid Search  Fisher Discriminant  Quadratic Discriminant  Naïve Bayes (Likelihood Discriminant)  Kernel Density Estimation  Support Vector Machines  Binary Decision Trees  Neural Networks  Bayesian Neural Networks  RuleFit  Random Forests 21

Illustrative Example

Example – H to ZZ to 4 Leptons 23 SignalBackground We shall use this example to illustrate a few of the methods. We start with p(s) / p(b) ~ 1 / 20 and use x = (m Z1, m Z2 )

A 4-Lepton Event from CMS 24

Random Grid Search

26 Random Grid Search (RGS) Signal fraction Background fraction N tot = # events before cuts N cut = # events after cuts Fraction = N cut /N tot N tot = # events before cuts N cut = # events after cuts Fraction = N cut /N tot Take each point of the signal class as a a cut-point Take each point of the signal class as a a cut-point H.B.P. et al., Proceedings, CHEP 1995

Example – H to ZZ to 4Leptons 27 The red point gives p(s | x) / p(b | x) ~ 1 / 1

Linear & Quadratic Discriminants

29 Fisher (Linear) Discriminant Take p(x | s) and p(x | b) to be Gaussian (and dropping the constant term) yields decision boundary

Quadratic Discriminant If we use different covariance matrices for the signal and the background densities, we obtain the quadratic discriminant: a fixed value of  which defines a curved surface that partitions the space {x} into signal-rich and background-rich regions 30 decision boundary

Naïve Bayes (Likelihood Ratio Discriminant)

32 Naïve Bayes In this method the d-dimensional density p(x) is replaced by the approximation where q(x i ) are the 1-D marginal densities of p(x)

33 Naïve Bayes The naïve Bayes estimate of D(x) is given by In spite of its name, this method can sometimes provide very good results. And, of course, if the variables truly are statistically independent then this approximation is expected to be good.

Kernel Density Estimation

35 Kernel Density Estimation Basic Idea Place a kernel function at each point and adjust their widths to obtain the best approximation Parzen Estimation (1960s) Mixtures

36 Kernel Density Estimation Why does it work? In the limit N goes to infinity the true density p(x) will be recovered provided that the kernel converges to a d-dimensional δ-function:

KDE of Signal and Background points / KDE

PART II 38

Support Vector Machines

40 Support Vector Machines This is a generalization of the Fisher discriminant (Boser, Guyon and Vapnik, 1992). Basic Idea non-separable Data that are non-separable in d-dimensions may be better separated if mapped into a space of higher (usually, infinite) dimension As in the Fisher discriminant, a hyper-plane is used to partition the high dimensional space

41 Support Vector Machines Consider separable data in the high dimensional space green plane:w.h(x) + c = 0 red plane:w.h(x 1 )+ c = +1 blue plane:w.h(x 2 )+ c = – 1 subtract blue from red w.[h(x 1 ) – h(x 2 )] = 2 and normalize the high dimensional vector w ŵ.[h(x 1 ) – h(x 2 )] = 2/||w|| h(x1)h(x1) h(x2)h(x2) w

42 Support Vector Machines The quantity m = ŵ.[h(x 1 ) – h(x 2 )], the distance between the red and blue planes, is called the margin. The best separation occurs when the margin is as large as possible. Note: because m ~ 1/||w||, maximizing the margin is equivalent to minimizing ||w|| 2 h(x1)h(x1) h(x2)h(x2) w

43 Support Vector Machines Label the red dots y = +1 and the blue dots y = –1. The task is to minimize ||w|| 2 subject to the constraint y i (w.h(x i ) + c) ≥ 1,i = 1 … N that is, the task is to minimize x1 x1 x2x2 w where the  are Lagrange multipliers

44 Support Vector Machines When L(w,c,  ) is minimized with respect to w and c, the function L(w,c,  ) can be transformed to At the minimum of E(α), the only non-zero coefficients  are those corresponding to points on the red and blue planes: that is, the so-called support vectors. The key idea is to replace the scalar product h(x i ).h(x j ) between two vectors of infinitely many dimensions by a kernel function K(x i, x j ). The (unsolved) problem is how to choose the correct kernel for a given problem?

Neural Networks

46 Given T T = x, y x = {x 1,…x N },y = {y 1,…y N } of N training examples Find Find a function f (x, w), with parameters w, using maximum likelihood Neural Networks

47 A typical likelihood for classification is www p(y | x, w) = Π i n(x i, w) y [1 – n(x i, w)] 1-y where y = 0 for background events (e.g, junk ) y = 1 for signal events (e.g., “good” ) w As noted in the previous lecture, IF n(x, w) is sufficiently flexible, then the maximum likelihood solution for w is n(x, w) = p(s | x), asymptotically, that is, as we acquire more and more data Neural Networks

48 Problem 13: Prove the conjecture In general, it is impossible to do the following: f(x 1,…,x n ) = F( g 1 (x 1 ),…, g n (x n ) ) But, in 1957, Kolmogorov disproved Hilbert’s conjecture! Today, we know that functions of the form can provide arbitrarily accurate approximations. (Hornik, Stinchcombe, and White, Neural Networks 2, (1989)) Historical Aside – Hilbert’s 13 th Problem

49 NN – Graphical Representation n(x, w) x1x1 x2x2 cjcj a f is used for regression n is used for classification w = a, b, c, d bjbj d ji

Example – H to ZZ to 4Leptons 50 Signal Background

Decision Trees

52 A decision tree is a sequence of if then else statements, that is, cuts Basic idea: choose cuts that partition the space {x} into regions of increasing purity and do so recursively Decision Trees root node leaf node MiniBoone, Byron Roe child node

53 Decision Trees Geometrically, a decision tree is simply a d-dimensional histogram whose bins are constructed recursively To each bin we associate the value of the function f (x) to be approximated. This provides a piecewise constant approximation of f (x). MiniBoone, Byron Roe Energy (GeV) PMT Hits 100 B = 10 S = 9 B = 37 S = 4 B = 1 S = 39 f(x) = 0f(x) = 1 f(x) = 0

54 Decision Trees For each variable find the best cut, defined as the one that yields the biggest decrease in impurity = Impurity(parent bin) – Impurity(“left”-bin) – Impurity(“right”-bin) Then choose the best cut among these cuts, partition the space, and repeat with each child bin Energy (GeV) PMT Hits 100 B = 10 S = 9 B = 37 S = 4 B = 1 S = 39 f(x) = 0f(x) = 1 f(x) = 0

55 Decision Trees The most common impurity measure is the Gini index (Corrado Gini, 1912): Impurity = p (1 – p) where p = S / (S + B) p = 0 or 1 = maximal purity p = 0.5 = maximal impurity Energy (GeV) PMT Hits 100 B = 10 S = 9 B = 37 S = 4 B = 1 S = 39 f(x) = 0f(x) = 1 f(x) = 0

Ensemble Methods

57 Ensemble Methods Suppose that you have a collection of discriminants f (x, w k ), which, individually, perform only marginally better than random guessing. It is possible to build from such discriminants (weak learners) highly effective ones by averaging over them: Jeromme Friedman & Bogdan Popescu (2008)

58 Ensemble Methods The most popular methods (used mostly with decision trees) are:  Bagging:each tree trained on a bootstrap sample drawn from training set  Random Forest: bagging with randomized trees  Boosting: each tree trained on a different weighting of full training set

59 Repeat K times: w 1.Create a decision tree f (x, w) 2.Compute its error rate  on the weighted training set 3.Compute  = ln (1–  ) /  4.Modify training set: increase weight of incorrectly classified examples relative to the weights of those that are correctly classified Then compute weighted average f (x) = ∑  k f (x, w k ) Y. Freund and R.E. Schapire. Journal of Computer and Sys. Sci. 55 (1), 119 (1997) Adaptive Boosting

First 6 Decision Trees 60

First 100 Decision Trees 61

Averaging over a Forest 62

Error Rate vs Number of Trees 63

Example – H to ZZ to 4Leptons 64 Signal Background 200 trees with a minimum of 100 counts per bin (leaf)

Bayesian Neural Networks

66 Bayesian Neural Networks Given w TT wwT p(w | T) =p(T | w) p(w) / p(T) over the parameter space of the functions ww n(x, w) = 1 / [1+exp(– f (x, w))] one can estimate p(s | x) as follows ww T p(s | x) ~ n(x) = ∫ n(x, w) p(w | T) dw n(x) is called a Bayesian Neural Network (BNN)

67 Bayesian Neural Networks Generate Sample N points {w} from p(w | T) using a Markov chain Monte Carlo (MCMC) technique (see talk by Glen Cowan) and average over the last M points ww T n(x) = ∫ n(x, w) p(w | T) dw w i ~ ∑ n(x, w i ) / M

Dots p(s | x) = H s / (H s +H b ) H s, H b, 1-D histograms Curves Individual NNs w k n(x, w k ) Black curve n(x) = 68 Example – H to ZZ to 4Leptons

Modeling Response Functions Consider the observed jet p T spectrum: In order to avoid unfolding, we need to publish the response function, and the prior The latter can be supplied as a sample of points (ν, ω) over which one would average. But what of the response function? 69

Modeling Response Functions The response function can be written as Therefore, in principle, it could be modeled as follows: 1.Build a discriminant D between the original events and the events in which the jet p T is replaced by a value sampled from a known distribution u(p T ) 2.Approximate R using 70

71 Summary  Multivariate methods can be applied to many aspects of data analysis.  Many practical methods, and convenient tools such as TMVA, are available for regression and classification.  All methods approximate the same mathematical entities, but no one method is guaranteed to be the best in all circumstances. So, just as is true of statistical methods, it is good practice to experiment with a few of them!

72 MVA Tutorials 1.Download (the 25M!) file tutorials.tar.gz from 2.Unpack tar zxvf tutorials.tar.gz 3.Check cd tutorials-cowan python expFit.py