Multivariate Analysis Techniques

Slides:

Advertisements

Similar presentations

Principles of Density Estimation

Advertisements

Pattern Recognition and Machine Learning

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Chapter 4: Linear Models for Classification

Computer vision: models, learning and inference

Decision Trees and Boosting

Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret our measurements How do we get the data for our measurements.

x – independent variable (input)

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.

Machine Learning CMPT 726 Simon Fraser University

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Chapter 6: Multilayer Neural Networks

Ensemble Learning (2), Tree and Forest

Radial Basis Function Networks

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

G. Cowan Lectures on Statistical Data Analysis Lecture 7 page 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem 2Random variables and.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 3: Multivariate Methods (II) 清华大学高能物理研究中心 2010 年 4 月 12—16.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

1 E. Fatemizadeh Statistical Pattern Recognition.

Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Machine Learning 5. Parametric Methods.

G. Cowan Lectures on Statistical Data Analysis Lecture 6 page 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem 2Random variables and.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Helge VossAdvanced Scientific Computing Workshop ETH Multivariate Methods of data analysis Helge Voss Advanced Scientific Computing Workshop ETH.

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Today’s Lecture Neural networks Training

Machine Learning Supervised Learning Classification and Regression

Support Vector Machines

CS 9633 Machine Learning Support Vector Machines

Bagging and Random Forests

Deep Feedforward Networks

Multivariate Methods of

LECTURE 11: Advanced Discriminant Analysis

Multivariate Methods of

کاربرد نگاشت با حفظ تنکی در شناسایی چهره

Comment on Event Quality Variables for Multivariate Analyses

Multi-dimensional likelihood

ECE 5424: Introduction to Machine Learning

Data Mining Lecture 11.

Hidden Markov Models Part 2: Algorithms

Neuro-Computing Lecture 4 Radial Basis Function Network

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Recognition and Machine Learning

Computing and Statistical Data Analysis Stat 5: Multivariate Methods

Ensemble learning.

Generally Discriminant Analysis

Neural networks (1) Traditional multi-layer perceptrons

Nonparametric density estimation and classification

Computer Vision Lecture 19: Object Recognition III

Support Vector Machines 2

Presentation transcript:

Multivariate Analysis Techniques Helge Voss(*) (MPI–K, Heidelberg) Terascale Statistics Tool School , DESY, 29.9-2.10, 2008

Outline Introduction: Classifiers Will not talk about the classification problem what is Multivariate Data Analysis and Machine Learning Classifiers Bayes Optimal Analysis Kernel Methods and Likelihood Estimators Linear Fisher Discriminant Neural Networks Support Vector Machines BoostedDecision Trees Will not talk about Unsupervised learning Regression Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Literature /Software packages just a short and biased selection.. Literature: T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001 C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006 Software packages for Mulitvariate Data Analysis/Classification individual classifier software e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad attempts to provide “all inclusive” packages StatPatternRecognition: I.Narsky, arXiv: physics/0507143 http://www.hep.caltech.edu/~narsky/spr.html TMVA: Höcker,Speckmayer,Stelzer,Tegenfelt,Voss,Voss, arXiv: physics/0703039 http:// tmva.sf.net or every ROOT distribution (not necessarily the latest TMVA version though  ) WEKA: http://www.cs.waikato.ac.nz/ml/weka/ Conferences: PHYSTAT, ACAT,… Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Event Classification in High Energy Phyisics (HEP) Most HEP analyses require discrimination of signal from background: Event level (Higgs searches, …) Cone level (Tau. vs. quark-jet reconstruction, …) Track level (particle identification, …) Lifetime and flavour tagging (b-tagging, …) etc. Input information from multiple variables from various sources Kinematic variables (masses, momenta, decay angles, …) Event properties (jet/lepton multiplicity, sum of charges, …) Event shape (sphericity, Fox-Wolfram moments, …) Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …) etc. Traditionally few powerful input variables were combined; new methods allow to use up to 100 and more variables w/o loss of classification power Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Event Classification How can we decide this in an “optimal” way ? Suppose data sample of two types of events: hypothesis H0, H1 , or class labels Background and Signal (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged) we have discriminating variables x1, x2, … how to set the decision boundary to select events of type H1 ? Rectangular cuts? A linear boundary? A nonlinear one? H1 H0 x1 x2 H1 H0 x1 x2 H1 H0 x1 x2 How can we decide this in an “optimal” way ? Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

D  Event Classification “feature space” Find a mapping from n-dimensional input space to one dimensional output  class lables y(x) most general form y = y(x); x D x={x1,….,xD}: input variables y(H0)  0, y(H1)  1 D “feature space” y(x): RnR:  y(x): “test statistic” in D-dimensional space of input variables y(x)=const: surface defining the decision boundary. distributions of y(x): yS(y) and yB(y) >cut: signal =cut: decision boundary <cut: background y(x): used to set cuts that define efficiency and purity of the selection overlap of yS(y) and yB(y)  separation power , purity Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Event Classification y(x) yB(y): normalised distribution of y(x) for background events yS(y): normalised distribution of y(x) for signal events with y=y(x) one can also say: yB(x): normalised distribution of y(x) for background events yS(x): normalised distribution of y(x) for signal events with fS and fB being the fraction of signal and background events in the sample: Probability of event with measured x={x1,….,xD} is of type signal Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Event Classification P(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the measured observables x={x1,….,xD} probability density distribution of the measurements for a given class of events. i.e. the differential cross section prior probability to observe an event of “class y” i.e. relative abundance of “signal” versus “background” p posterior probability overall probability density to observe the actual measurement x. i.e. Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Bayes Optimal Classification + minimum error in misclassification if C chosen such that is has maximum P(C|x) i.e. to select S(ignal) over B(ackground), place decision on or any monotonic function of P(S|x)/P(B|x) prior odds of choosing a signal event (relative probability of signal vs. bkg) Likelihood ratio as discriminating function y(x) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Neyman-Pearson Lemma 1- ebackgr. 1 “limit” in ROC curve given by likelihood ratio 1 The Likelihood ratio used as “selection criterium” y(x) ogives for each selection efficiency the best possible background rejection. i.e. it maximises the area under the “ROC-Receiver Operation Characteristics” curve better classification 1- ebackgr. good classification random guessing esignal 1 y(x) is the discriminating function given by your estimator (i.e. the likelihood ratio) varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve where to choose your working point?  need to know prior probabilities (abundances) measurement of signal cross section: maximum of S/√(S+B) discovery of a signal: maximum of S/√(B) precision measurement: high purity trigger selection: high efficiency Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Efficiency or Purity ? 1- ebackgr. Any decision involves a certain “risk” of misclassification. “limit” in ROC curve given by likelihood ratio Type 1 error: classify event as Class C even though it is not (reject null hypothesis although true) loss of purity Type 2 error: fail to identify an event from Class C as such (probability to reject a false null hypothesis) loss of efficiency 1 “y(x)” 1- ebackgr. “y(x)” which one of those two is the better?? esignal 1 if discriminating function y(x) = “true likelihood ratio” optimal working point for specific analysis lies somewhere on the ROC curve y(x) ≠ “true likelihood ratio” differently, point y(x) might be better for a specific working point than y(x) and vice versa Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Event Classification Unfortunately, the true probability densities functions are typically unknown:  Neyman-Pearsons lemma doesn’t really help us directly HEP  Monte Carlo simulation or in general cases: set of known (already classified) “events” Use these “training” events to: try to estimate the functional form of p(x|C): (e.g. the differential cross section folded with the detector influences) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, … find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background e.g. Linear Discriminator, Neural Networks, …  supervised (machine) learning * hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Nearest Neighbour and Kernel Density Estimator “events” distributed according to p(x) Trying to estimate the probability density p(x) in D-dimensional space: x2 h One expects to find in a volume V around point “x” N*∫p(x)dx events from a dataset with N events Given a dataset of N events, one finds K-events: “x” x1 Kernel function: Parzen-Window x1 x2 Keep the number K fixed and vary the volume size V to match K size of V  measure of p(x) at “x” kNN : k-Nearest Neighbours relative number events of the various classes amongst the k-nearest neighbours Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Nearest Neighbour and Kernel Density Estimator “events” distributed according to p(x) Trying to estimate the probability density p(x) in D-dimensional space: x2 h One expects to find in a volume V around point “x” N*∫p(x)dx events from a dataset with N events Kernel function: Parzen-Window Given a dataset of N events, one finds K-events: “x” x1 x2 keep volume V constant and let K vary:  Kernel Density estimator of the probability density x1 Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Kernel Density Estimator Parzen Window: “rectangular Kernel”  discontinuities at window edges smoother model for p(x) when using smooth Kernel Fuctions: e.g. Gaussian ↔ probability density estimator individual kernels averaged kernels place a “Gaussian” around each “training data point” and sum up their contributions at arbitrary points “x”  p(x) h: “size” of the Kernel  “smoothing parameter” there is a large variety of possible Kernel functions Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Kernel Density Estimator : a general probability density estimator using kernel K h: “size” of the Kernel  “smoothing parameter” chosen size of the “smoothing-parameter”  more important than kernel function h too small: overtraining h too large: not sensitive to features in p(x) for Gaussian kernels, the optimum in terms of MISE (mean integrated squared error) is given by: hxi=(4/(3N))1/5 sxi with sxi=RMS in variable xi (Christopher M.Bishop) a drawback of Kernel density estimators: Evaluation for any test events involves ALL TRAINING DATA  typically very time consuming binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

“Curse of Dimensionality” Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. We all know: Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due to lack of Monte Carlo events. Shortcoming of nearest-neighbour strategies: in higher dimensional classification/regression cases the idea of looking at “training events” in a reasonably small “vicinity” of the space point to be classified becomes difficult: consider: total phase space volume V=1D for a cube of a particular fraction of the volume: In 10 dimensions: in order to capture 1% of the phase space  63% of range in each variable necessary Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Naïve Bayesian Classifier “often called: (projective) Likelihood” while Kernel methods or Nearest Neighbour classifiers try to estimat the full D-dimensional joint probability distributions If correlations between variables are weak:  discriminating variables Classes: signal, background types Likelihood ratio for event kevent PDFs One of the first and still very popular MVA-algorithm in HEP rather than making hard cuts on individual variables, allow for some “fuzzyness”: one very signal like variable may counterway another less signal like variable This is about the optimal method in the absence of correlations PDE introduces fuzzy logic Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Naïve Bayesian Classifier “often called: (projective) Likelihood” How parameterise the 1-dim PDFs ?? Automatic, unbiased, but suboptimal event counting (histogramming) Difficult to automate for arbitrary PDFs parametric (function) fitting Easy to automate, can create artefacts/suppress information nonparametric fitting (i.e. splines,kernel) original distribution is Gaussian Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

What if there are correlations? Typically correlations are present: Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)  pre-processing: choose set of linear transformed input variables for which Cij = 0 (i≠j) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Decorrelation Determine square-root C  of correlation matrix C, i.e., C = C C  compute C  by diagonalising C: transformation from original (x) in de-correlated variable space (x) by: x = C 1x In general: the decorrelation for signal and background variables is different: for likelihood ratio, decorrelate signal and background independently for other estimators, one may need to decide on one of the two… Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? Original correlations SQRT decorrelation Signal Background Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases ? Original correlations SQRT decorrelation Signal Background Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Linear Discriminant If non parametric methods like ‘k-Nearest-Neighbour” or “Kernel Density Estimators” suffer from lack of training data  “curse of dimensionality” slow response time  need to evaluate the whole training data for every test event  use of parametric models y(x) to fit to the training data Linear Discriminant: i.e. any linear function of the input variables: giving rise to linear decision boundaries H1 H0 x1 x2 How do we determine the “weights” w that do “best”?? Watch out: these lines are NOT the linear function y(x), but rather the decision boundary given by y(x)=const. Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Fisher’s Linear Discriminant y(x) How do we determine the “weights” w that do “best”?? Maximise the “separation” between the two classes S and B  minimise overlap of the distribution yS(x) and yB(x) maximise the distance between the two mean values of the classes minimise the variance within each class yB(x) yS(x)  maximise the Fisher coefficients note: these quantities can be calculated from the training data Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Linear Discriminant and non linear correlations assume the following non-linear correlated data: the Linear discriminant obviousl doesn’t do a very good job here: Of course, these can easily de-correlated: here: linear discriminator works perfectly on de-correlated data Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Linear Discriminant with Quadratic input: A simple extension of the linear decision boundary to “quadratic” decision boundary: var0 var1 while: linear decision boundaries in var0,var1 var0 * var0 var1 * var1 var0 * var1 quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant with quadratic input: Performance of Fisher Discriminant: Fisher Fisher with decorrelated variables Fisher with quadratic input Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Summary (1) More Classifiers: Event selection certainly is not done best using traditional “cuts” on individual variables Multivariate Analysis  Combination of variables (taking into account variable correlations) Optimal event selection is based on the likelihood ratio kNN and Kernal Methods are attempts to estimate the probability density in D-dimensions “naïve Bayesian” or “projective Likelihood” ignores variable correlations Use de-correlation pre-processing of the data if appropriate More Classifiers: Fishers Linear discriminant  simple and robust Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Multivariate Analysis Techniques (2) Helge Voss(*) (MPI–K, Heidelberg) Terascale Statistics Tool School , DESY, 29.9-2.10, 2008

Outline Introduction: Classifiers Will not talk about the classification problem what is Multivariate Data Analysis and Machine Learning Classifiers Bayes Optimal Analysis Kernel Methods and Likelihood Estimators Linear Fisher Discriminant Neural Networks Support Vector Machines BoostedDecision Trees Will not talk about Unsupervised learning Regression Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Neural Networks naturally, if we want to go the “arbitrary” non-linear decision boundaries, y(x) needs to be constructed in “any” non-linear fashion Think of hi(x) as a set of “basis” functions and and if sufficiently general (i.e. non linear), a linear combination of “enough” basis function should allow to describe any possible discriminating function y(x) K.Weierstrass Theorem: proves just that previous statement. Imagine you’d chose hi(x) to be such that: y(x) = a linear combination of non linear functions of linear combination of the input data Now we “only” need to find the appropriate “weights” w Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Neural Networks: Multilayer Perceptron MLP But before talking about the weights, let’s try to “interpret” the formula as a Neural Network: input layer hidden layer ouput layer 1 1 output: . . . . . . Dvar discriminating input variables as input + 1 offset j “Activation” function e.g. sigmoid: or tanh i . . . . . . k . . . D M1 Nodes in hidden layer represent the “activation functions” whose arguments are linear combinations of input variables  non-linear response to the input The output is a linear combination of the output of the activation functions at the internal nodes Input to the layers from preceding nodes only  feed forward network (no backward loops) It is straightforward to extend this to “several” input layers Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Neural Networks: Multilayer Perceptron MLP But before talking about the weights, let’s try to “interpret” the formula as a Neural Network: input layer hidden layer ouput layer 1 1 output: . . . . . . Dvar discriminating input variables as input + 1 offset j “Activation” function e.g. sigmoid: or tanh i . . . . . . k . . . D M1 nodesneurons links(weights)synapses Neural network: try to simulate reactions of a brain to certain stimulus (input data) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Neural Network Training idea: using the “training events” adjust the weights such, that y(x)0 for background events y(x)1 for signal events i.e. use usual “sum of squares” how do we adjust ? minimize Loss function: where true event type predicted y(x) is a very “wiggly” function with many local minima. A global overall fit in the many parameters is possible but not the most efficient method to train neural networks back propagation (learn from experience, gradually adjust your perception to match reality online learning (learn event by event -- not only at the end of your life from all experience) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Neural Network Training back-propagation and online learning start with random weights adjust weights in each step a bit in the direction of the steepest descend of the “Loss”- function L for weights connected to output nodes for weights connected to output nodes … a bit more complicated formula note: all these gradients are easily calculated from the training event training is repeated n-times over the whole training data sample. how often ?? for online learning, the training events should be mixed randomly, otherwise you first steer in a wrong direction from which it is afterwards hard to get out again! Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Overtraining training is repeated n-times over the whole training data sample. how often ?? it seems intuitive that this boundary will give better results in another statistically independent data set than that one H1 H0 x1 x2 x2 H1 H0 x1 stop training before you learn statistical fluctuations in the data verify on independent “test” sample classificaion error test sample possible overtraining is concern for every “tunable parameter” a of classifiers: Smoothing parameter, n-nodes… training sample training cycles a Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Cross Validation many (all) classifiers have tuning parameters “a” that need to be controlled against overtraining number of training cycles, number of nodes (neural net) smoothing parameter h (kernel density estimator) …. the more free parameter a classifier has to adjust internally  more prone to overtraining more training data  better training results division of data set into “training” and “test” sample reduces the “training” data Cross Validation: divide the data sample into say 5 sub-sets Train Test Train Test Train Test Train Test Train Test train 5 classifiers: yi(x,a) : i=1,..5, classifier yi(x,a) is trained without the i-th sub sample calculate the test error: choose tuning parameter a for which CV(a) is minimum and train the final classifier using all data Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

What is the best Network Architecture? Theoretically a single hidden layer is enough for any problem, provided one allows for sufficient number of nodes. (K.Weierstrass theorem) “Relatively little is known concerning advantages and disadvantages of using a single hidden layer with many nodes over many hidden layers with fewer nodes. The mathematics and approximation theory of the MLP model with more than one hdden layer is not very well understood ……” ….”nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model” A.Pinkus, “Approximation theory of the MLP model with neural networks”, Acta Numerica (1999),pp.143-195 (Glen Cowan) Typically in high-energy physics, non-linearities are reasonably simple,  in many cases 1 layer with a larger number of nodes should be enough Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Support Vector Machine The main difficulty with Neural Networks is that the Loss function has many local minima as a function of the weights. There are methods out there that find a linear decision boundary only using quadratic forms  minimaisation procedure has unique minimum only We’ve already seen that variable transformations may allow to essentially implement non-linear decision boundarys with linear classifiers Support Vector Machines Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Support Vector Machines Find hyperplane that best separates signal from background x2 support vectors Best separation: maximum distance (margin) between closest events (support) to hyperplane Linear decision boundary 1 Non-separable data 2 Separable data optimal hyperplane 3 If data non-separable add misclassification cost parameter C·ii to minimisation function 4 margin Solution of only depends on inner product of support vectors  quadratic minimisation problem x1 Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Support Vector Machines Find hyperplane that best separates signal from background x1 x2 x1 x3 x2 Best separation: maximum distance (margin) between closest events (support) to hyperplane Linear decision boundary Non-separable data Separable data If data non-separable add misclassification cost parameter C·ii to minimisation function (x1,x2) Solution of only depends on inner product of support vectors  quadratic minimisation problem Non-linear cases: Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data Explicit basis functions not required: use Kernel Functions to approximate scalar products between transformed vectors in the higher dimensional feature space Choose Kernel and fit the hyperplane using the linear techniques developed above Possible Kernels e.g.: Gaussian, Polynomial, Sigmoid Kernel size paramter typically needs careful tuning! Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Boosted Decision Trees Decision Tree: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Boosted Decision Trees Decision Tree: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background used since a long time in general “data-mining” applications, less known in HEP (although very similar to “simple Cuts”) easy to interpret, visualised independent of monotonous variable transformations, immune against outliers weak variables are ignored (and don’t (much) deteriorate performance) Disadvatage  very sensitive to statistical fluctuations in training data Boosted Decision Trees (1996): combine forest of Decision Trees, derived from the same sample, e.g. using different event weights. overcomes the stability problem  became popular in HEP since MiniBooNE, B.Roe et.a., NIM 543(2005) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Growing a Decision Tree Training: start with all training events at root node split training sample at node into two, using a cut in the variable that gives best separation gain continue splitting until: minimal #events reached or maximum number of nodes leaf-nodes classify S,B according to the majority of events / or give a S/B probability Why not multiple branches (splits) per node ? Fragments data too quickly; also: multiple splits per node = series of binary node splits What about multivariate splits? Time consuming other methods more adapted for such correlatios Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Separation Gain What do we mean by “best separation gain”? define a measure on how mixed S and B in a node are: Gini-index: (Corrado Gini 1912, typically used to measure income inequality) p (1-p) : p=purity Cross Entropy: -(plnp + (1-p)ln(1-p)) Misidentification: 1-max(p,1-p) cross entropy Gini index misidentification purity difference in the various indices are small, most commonly used: Gini-index separation gain: e.g. NParent*GiniParent – Nleft*GiniLeftNode – Nright*GiniRightNode Choose amongst all possible variables and cut values the one that maximised the this. Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Decision Tree Pruning One can continue node splitting until all leaf nodes are basically pure (using the training sample) obviously: that’s overtraining Two possibilities: stop growing earlier generally not a good idea, useless splits might open up subsequent usefull splits grow tree to the end and “cut back”, nodes that seem statistically dominated:  pruning e.g. Cost Complexity pruning: assign to every sub-tree, T C(T,a) : find subtree T with minmal C(T,a) for given a prune up to value of a that does not show overtraining in the test sample Loss function regularisaion/ cost parameter Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Decision tree before pruning Decision Tree Pruning “Real life” example of an optimally pruned Decision Tree: Decision tree after pruning Decision tree before pruning Pruning algorithms are developed and applied on individual trees optimally pruned single trees are not necessarily optimal in a forest ! Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Boosting classifier Training Sample C(0)(x) re-weight classifier Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample re-weight classifier C(3)(x) C(m)(x) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Adaptive Boosting (AdaBoost) classifier C(0)(x) AdaBoost re-weights events misclassified by previous classifier by: Training Sample Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample re-weight classifier C(3)(x) C(m)(x) AdaBoost weights the classifiers also using the error rate of the individual classifier according to: Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Bagging and Randomised Trees other classifier combinations: Bagging: combine trees grown from “bootstrap” samples (i.e re-sample training data with replacement) Randomised Trees: (Random Forest: trademark L.Breiman, A.Cutler) combine trees grown with: random subsets of the training data only random subsets of variables available only at each split NO Pruning! These combined classifiers work surprisingly well, are very stable and almost perfect “out of the box” classifiers Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

AdaBoost: A simple demonstration The example: (somewhat artificial…but nice for demonstration) : Data file with three “bumps” Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps ) var(i) > x var(i) <= x B S b) a) Two reasonable cuts: a) Var0 > 0.5  εsignal=66% εbkg ≈ 0% misclassified events in total 16.5% or b) Var0 < -0.5  εsignal=33% εbkg ≈ 0% misclassified events in total 33% the training of a single decision tree stump will find “cut a)” Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

AdaBoost: A simple demonstration The first “tree”, choosing cut a) will give an error fraction: err = 0.165  before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5  the next “tree” sees essentially the following data sample: b) .. and hence will chose: “cut b)”: Var0 < -0.5 re-weight The combined classifier: Tree1 + Tree2 the (weighted) average of the response to a test event from both trees is able to separate signal from background as good as one would expect from the most powerful classifier Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

AdaBoost: A simple demonstration Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

AdaBoost vs other Combined Classifiers Often people present “boosting” as nothing else then just “smearing” in order to make the Decision Trees more stable w.r.t statistical fluctuations in the training. clever “boosting” however can do more, than for example: - Random Forests - Bagging as in this case, pure statistical fluctuations are not enough to enhance the 2nd peak sufficiently _AdaBoost however: a “fully grown decision tree” is much more than a “weak classifier”  “stabilization” aspect is more important Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Learning with Rule Ensembles Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003 Model is linear combination of rules, where a rule is a sequence of cuts RuleFit classifier rules (cut sequence  rm=1 if all cuts satisfied, =0 otherwise) normalised discriminating event variables Sum of rules Linear Fisher term The problem to solve is Create rule ensemble: use forest of decision trees Fit coefficients am, bk: gradient direct regularization minimising Risk (Friedman et al.) Pruning removes topologically equal rules” (same variables in cut sequence) One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30=000111102. Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

See some Classifiers at Work Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Toy Examples: Linear-, Cross-, Circular Correlations Illustrate the behaviour of linear and nonlinear classifiers Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Weight Variables by Classifier Output How well do the classifier resolve the various correlation patterns ? Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) MLP Fisher BDT Likelihood PDERS Likelihood - D Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Some Words about Systematic Errors Typical worries are: What happens if the estimated “Probability Density” is wrong ? Can the Classifier, i.e. the discrimination function y(x), introduce systematic uncertainties? What happens if the training data do not match “reality” Any wrong pdf leads to imperfect discrimination function Imperfect (“wrong”) y(x)  loss of discrimination power that’s all! classical cuts face exactly the same problem only: in addition to cutting on features that are not correct, you can also cut on correlations that are not there Systematic error are only introduced once “Monte Carlo events” with imperfect modeling are used for efficiency/purity #expected events same problem with classical “cut” analysis use control samples to test MVA-output distribution (y(x)) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Treatment of Systematic Uncertainties Is there a strategy however to become ‘less sensitive’ to possible systematic uncertainties i.e. classically: variable that is prone to uncertainties  do not cut in the region of steepest gradient classically one would not choose the most important cut on an uncertain variable Try to make classifier less sensitive to “uncertain variables” i.e. re-weight events in training to decrease separation in variables with large systematic uncertainty “Calibration uncertainty” may shift the central value and hence worsen (or increase) the discrimination power of “var4” (certainly not yet a recipe that can strictly be followed, more an idea of what could perhaps be done) Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Treatment of Systematic Uncertainties 1st Way Classifier output distributions for signal only Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Treatment of Systematic Uncertainties 2nd Way Classifier output distributions for signal only Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Summary (2) More Classifiers: How to avoid overtraining Event selection certainly is not done best using traditional “cuts” on individual variables Multivariate Analysis  Combination of variables (taking into account variable correlations) Optimal event selection is based on the likelihood ratio kNN and Kernal Methods are attempts to estimate the probability density in D-dimensions “naïve Bayesian” or “projective Likelihood” ignores variable correlations Use de-correlation pre-processing of the data if appropriate More Classifiers: Fishers Linear discriminant  simple and robust Neural networks  very powerful but difficult to train (multiple local minima) Support Vector machines  one global minimum but needs careful tuning Boosted Decision Trees  a “brute force method” How to avoid overtraining Systematic errors be as careful as with “cuts” and check against data Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques

Summary of Classifiers and their Properties Criteria Classifiers Cuts Likeli-hood PDERS/ k-NN H-Matrix Fisher MLP BDT RuleFit SVM Perfor-mance no / linear correlations   nonlinear correlations  Speed Training Response / Robust-ness Overtraining Weak input variables Curse of dimensionality Transparency The properties of the Function discriminant (FDA) depend on the chosen function A Helge Voss Terascale Statistics Tools School, DESY, Sep. 29, 2008 ― Multivariate Analysis Techniques