Introduction to Statistics and Machine Learning

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Applications of one-class classification
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC
Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
x – independent variable (input)
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Maximum likelihood (ML) and likelihood ratio (LR) test
Lecture Notes for CMPUT 466/551 Nilanjan Ray
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Maximum likelihood (ML)
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Principles of Pattern Recognition
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 E. Fatemizadeh Statistical Pattern Recognition.
Statistics In HEP Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Principal Component Analysis (PCA)
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
1 Introduction to Statistics − Day 2 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
G. Cowan Lectures on Statistical Data Analysis Lecture 6 page 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem 2Random variables and.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Helge VossAdvanced Scientific Computing Workshop ETH Multivariate Methods of data analysis Helge Voss Advanced Scientific Computing Workshop ETH.
Helge VossINFN School of Statistics Multivariate Discriminators Helge Voss INFN School of Statistics 2015 Ischia (Napoli, Italy) MAX-PLANCK-INSTITUT.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Lecture 1.31 Criteria for optimal reception of radio signals.
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Multivariate Methods of
Christoph Rosemann and Helge Voss DESY FLC
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Introduction to Statistics and Machine Learning How do we: understand interpret our measurements How do we get the data for our measurements

Outline Multivariate classification/regression algorithms (MVA) motivation another introduction/repeat the ideas of hypothesis tests in this context Multidimensional Likelihood (kNN : k-Nearest Neighbour) Projective Likelihood (naïve Bayes) What to do with correlated input variables? Decorrelation strategies Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

MVA-Literature /Software Packages... a biased selection T.Hastie, R.Tibshirani, J.Friedman, “The Elements of Statistical Learning”, Springer 2001 C.M.Bishop, “Pattern Recognition and Machine Learning”, Springer 2006 Software packages for Mulitvariate Data Analysis/Classification individual classifier software e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad and many other packages attempts to provide “all inclusive” packages StatPatternRecognition: I.Narsky, arXiv: physics/0507143 http://www.hep.caltech.edu/~narsky/spr.html TMVA: Höcker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss, arXiv: physics/0703039 http://tmva.sf.net or every ROOT distribution (development moved from SourceForge to ROOT repository) WEKA: http://www.cs.waikato.ac.nz/ml/weka/ “R”: a huge data analysis library: http://www.r-project.org/ Conferences: PHYSTAT, ACAT,… Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Event Classification S B S B S How can we decide what to uses ? Suppose data sample of two types of events: with class labels Signal and Background (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged) how to set the decision boundary to select events of type S ? we have discriminating variables x1, x2, … Rectangular cuts? A linear boundary? A nonlinear one? S B x1 x2 S B x1 x2 S B x1 x2 How can we decide what to uses ? Once decided on a class of boundaries, how to find the “optimal” one ? Low variance (stable), high bias methods High variance, small bias methods Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Regression how to estimate a “functional behaviour” from a given set of ‘known measurements” ? assume for example “D”-variables that somehow characterize the shower in your calorimeter  energy as function of the calorimeter shower parameters . constant ? linear? non - linear? x f(x) f(x) x f(x) x Energy Cluster Size if we had an analytic model (i.e. know the function is a nth -order polynomial) than we know how to fit this (i.e. Maximum Likelihood Fit) but what if we just want to “draw any kind of curve” and parameterize it? seems trivial ?  The human brain has very good pattern recognition capabilities! seems trivial ? what if you have many input variables? Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Regression  model functional behaviour Assume for example “D”-variables that somehow characterize the shower in your calorimeter. Monte Carlo or testbeam  data sample with measured cluster observables + known particle energy = calibration function (energy == surface in D+1 dimensional space) 1-D example 2-D example f(x) x events generated according: underlying distribution x y f(x,y) better known: (linear) regression  fit a known analytic function e.g. the above 2-D example  reasonable function would be: f(x) = ax2+by2+c what if we don’t have a reasonable “model” ?  need something more general: e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x)  NOT in order to “fit a parameter”  provide predition of function value f(x) for new measurements x (where f(x) is not known) Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

D  Event Classification “feature space” Each event, if Signal or Background, has “D” measured variables. Find a mapping from D-dimensional input-observable =”feature” space to one dimensional output  class label D “feature space” y(x): RDR: most general form y = y(x); x D x={x1,….,xD}: input variables  plotting (historamming) the resulting y(x) values: y(x) Who sees how this would look like for the regression problem? Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

D  Event Classification “feature space” Each event, if Signal or Background, has “D” measured variables. Find a mapping from D-dimensional input/observable/”feature” space to one dimensional output  class lables y(B)  0, y(S)  1 D “feature space” y(x): RnR:  y(x): “test statistic” in D-dimensional space of input variables distributions of y(x): PDFS(y) and PDFB(y) used to set the selection cut! > cut: signal = cut: decision boundary < cut: background y(x): efficiency and purity y(x)=const: surface defining the decision boundary. overlap of PDFS(y) and PDFB(y)  separation power , purity Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Classification ↔ Regression Each event, if Signal or Background, has “D” measured variables. y(x): RDR: “test statistic” in D-dimensional space of input variables y(x)=const: surface defining the decision boundary. D “feature space” y(x): RDR:  Regression: Each event has “D” measured variables + one function value (e.g. cluster shape variables in the ECAL + particles energy) y(x): RDR  find y(x)=const  hyperplanes where the target function is constant Now, y(x) needs to be build such that it best approximates the target, not such that it best separates signal from bkgr. f(x1,x2) X2 X1 Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Event Classification y(x): RnR: the mapping from the “feature space” (observables) to one output variable PDFB(y). PDFS(y): normalised distribution of y=y(x) for background and signal events (i.e. the “function” that describes the shape of the distribution) y(x) 1.5 with y=y(x) one can also say PDFB(y(x)), PDFS(y(x)): : 0.45 Probability densities for background and signal now let’s assume we have an unknown event from the example above for which y(x) = 0.2  PDFB(y(x)) = 1.5 and PDFS(y(x)) = 0.45 let fS and fB be the fraction of signal and background events in the sample, then: is the probability of an event with measured x={x1,….,xD} that gives y(x) to be of type signal Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Event Classification P(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the measured observables x={x1,….,xD}  y(x) Probability density distribution according to the measurements x and the given mapping function Prior probability to observe an event of “class C” i.e. the relative abundance of “signal” versus “background” Posterior probability Overall probability density to observe the actual measurement y(x). i.e. Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Any Decision Involves Risk! Decide to treat an event as “Signal” or “Background” Trying to select signal events: (i.e. try to disprove the null-hypothesis stating it were “only” a background event) Type-1 error: (false positive) classify event as Class C even though it is not (accept a hypothesis although it is not true/i.e.false) (reject the null-hypothesis although it would have been the correct one) loss of purity (e.g. accepting wrong events) Signal Back-ground  Type-2 error Type-1 error accept as: truly is: Type-2 error: (false negative) fail to identify an event from Class C as such (reject a hypothesis although it would have been correct/true) (fail to reject the null-hypothesis/accept null hypothesis although it is false) loss of efficiency (e.g. miss true (signal) events) “A”: region of the outcome of the test where you accept the event as Signal: Significance α: Type-1 error rate: α = background selection “efficiency” should be small Size β: Type-2 error rate: (how often you miss the signal) Power: 1- β = signal selection efficiency should be small most of the rest of the lecture will be about methods that try to make as little mistakes as possible  Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Neyman-Pearson Lemma 1- ebackgr. “limit” in ROC curve given by likelihood ratio 1 y’(x) y’’(x) Neyman-Peason: The Likelihood ratio used as “selection criterium” y(x) gives for each selection efficiency the best possible background rejection. i.e. it maximises the area under the “Receiver Operation Characteristics” (ROC) curve better classification 1- ebackgr. good classification random guessing esignal 1 varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve how to choose “cut”  need to know prior probabilities (S, B abundances) measurement of signal cross section: maximum of S/√(S+B) or equiv. √(e·p) discovery of a signal (typically: S<<B): maximum of S/√(B) precision measurement: high purity (p)  large background rejection trigger selection: high efficiency (e) Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

MVA and Machine Learning The previous slide was basically the idea of “Multivariate Analysis” (MVA) rem: What about “standard cuts” (event rejection in each variable separately with fix conditions. i.e. if x1>0 or x2<3 then background) ? Finding y(x) : RnR given a certain type of model class y(x) in an automatic way using “known” or “previously solved” events i.e. learn from known “patterns” such that the resulting y(x) has good generalization properties when applied to “unknown” events (regression: fits well the target function “in between” the known training events  that is what the “machine” is supposed to be doing: supervised machine learning Of course… there’s no magic, we still need to: choose the discriminating variables choose the class of models (linear, non-linear, flexible or less flexible) tune the “learning parameters”  bias vs. variance trade off check generalization properties consider trade off between statistical and systematic uncertainties Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Event Classification Unfortunately, the true probability densities functions are typically unknown:  Neyman-Pearsons lemma doesn’t really help us directly Monte Carlo simulation or in general cases: set of known (already classified) “events” 2 different ways: Use these “training” events to: estimate the functional form of p(x|C): (e.g. the differential cross section folded with the detector influences) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, … find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background e.g. Linear Discriminator, Neural Networks, … * hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Unsupervised Learning Just a short remark as we talked about “supervised” learning before: supervised: training with “events” for which we know the outcome (i.e. Signal or Backgr) un-supervised: - no prior knowledge about what is “Signal” or “Background” or … we don’t even know if there are different “event classes”, then you can for example do: - cluster analysis: if different “groups” are found  class labels - principal component analysis: find basis in observable space with biggest hierarchical differences in the variance  infer something about underlying substructure Examples: - think about “success” or “not success” rather than “signal” and “background” (i.e. a robot achieves his goal or does not / falls or does not fall/ …) - market survey: If asked many different question, maybe you can find “clusters” of people, group them together and test if there are correlations between this groups and their tendency to buy a certain product.  address them specialy - medical survey: group people together and perhaps find common causes for certain diseases Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Nearest Neighbour and Kernel Density Estimator “events” distributed according to P(x) estimate probability density P(x) in D-dimensional space: The only thing at our disposal is our “training data” x2 h Say we want to know P(x) at “this” point “x” One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” For the chosen a rectangular volume  K-events: x1 k(u): is called a Kernel function K (from the “training data”)  estimate of average P(x) in the volume V: ∫P(x)dx = K/N V  Kernel Density estimator of the probability density Classification: Determine PDFS(x) and PDFB(x) likelihood ratio as classifier! Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Nearest Neighbour and Kernel Density Estimator “events” distributed according to P(x) estimate probability density P(x) in D-dimensional space: The only thing at our disposal is our “training data” x2 h Say we want to know P(x) at “this” point “x” One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” For the chosen a rectangular volume  K-events: x1 k(u): is called a Kernel function: rectangular Parzen-Window K (from the “training data”)  estimate of average P(x) in the volume V: ∫P(x)dx = K/N V Regression: If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident particle)  i.e.: the average function value Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Nearest Neighbour and Kernel Density Estimator “events” distributed according to P(x) estimate probability density P(x) in D-dimensional space: The only thing at our disposal is our “training data” x2 h Say we want to know P(x) at “this” point “x” One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” For the chosen a rectangular volume  K-events: x1 x1 x2 determine K from the “training data” with signal and background mixed together kNN : k-Nearest Neighbours relative number events of the various classes amongst the k-nearest neighbours Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Kernel Density Estimator Parzen Window: “rectangular Kernel”  discontinuities at window edges smoother model for P(x) when using smooth Kernel Functions: e.g. Gaussian ↔ probability density estimator individual kernels averaged kernels place a “Gaussian” around each “training data point” and sum up their contributions at arbitrary points “x”  P(x) h: “size” of the Kernel  “smoothing parameter” there is a large variety of possible Kernel functions Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Kernel Density Estimator : a general probability density estimator using kernel K h: “size” of the Kernel  “smoothing parameter” chosen size of the “smoothing-parameter”  more important than kernel function h too small: overtraining h too large: not sensitive to features in P(x) which metric for the Kernel (window)? normalise all variables to same range include correlations ? Mahalanobis Metric: x*x  xV-1x (Christopher M.Bishop) a drawback of Kernel density estimators: Evaluation for any test events involves ALL TRAINING DATA  typically very time consuming binary search trees (i.e. Kd-trees) are typically used in kNN methods to speed up searching Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

“Curse of Dimensionality” Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. We all know: Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due to lack of Monte Carlo events. Shortcoming of nearest-neighbour strategies: in higher dimensional classification/regression cases the idea of looking at “training events” in a reasonably small “vicinity” of the space point to be classified becomes difficult: consider: total phase space volume V=1D for a cube of a particular fraction of the volume: In 10 dimensions: in order to capture 1% of the phase space  63% of range in each variable necessary  that’s not “local” anymore.. Therefore we still need to develop all the alternative classification/regression techniques Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Naïve Bayesian Classifier “often called: (projective) Likelihood” Multivariate Likelihood (k-Nearest Neighbour)  estimate the full D-dimensional joint probability density product of marginal PDFs (1-dim “histograms”) If correlations between variables are weak:  discriminating variables Classes: signal, background types Likelihood ratio for event event PDFs One of the first and still very popular MVA-algorithm in HEP No hard cuts on individual variables, allow for some “fuzzyness”: one very signal like variable may counterweigh another less signal like variable optimal method if correlations == 0 (Neyman Pearson Lemma) PDE introduces fuzzy logic Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Naïve Bayesian Classifier “often called: (projective) Likelihood” How parameterise the 1-dim PDFs ?? Automatic, unbiased, but suboptimal event counting (histogramming) Difficult to automate for arbitrary PDFs parametric (function) fitting Easy to automate, can create artefacts/suppress information nonparametric fitting (i.e. splines,kernel) example: original (underlying) distribution is Gaussian If the correlations between variables is really negligible, this classifier is “perfect” (simple, robust, performing) If not, you seriously loose performance  How can we “fix” this ? Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

What if there are correlations? Typically correlations are present: Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)  pre-processing: choose set of linear transformed input variables for which Cij = 0 (i≠j) Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Decorrelation Find variable transformation that diagonalises the covariance matrix Determine square-root C  of correlation matrix C, i.e., C = C C  compute C  by diagonalising C: transformation from original (x) in de-correlated variable space (x) by: x = C 1x Attention: eliminates only linear correlations!! Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Decorrelation: Principal Component Analysis PCA (unsupervised learning algorithm) reduce dimensionality of a problem find most dominant features in a distribution Eigenvectors of covariance matrix  “axis” in transformed variable space large eigenvalue  large variance along the axis (principal component) sort eigenvectors according to their eigenvalues transform dataset accordingly diagonalised covariance matrix with first “variable”  variable with largest variance Principle Component (PC) of variable k sample means eigenvector Matrix of eigenvectors V obey the relation:  PCA eliminates correlations! correlation matrix diagonalised square root of C Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

How to Apply the Pre-Processing Transformation? Correlation (decorrelation): different for signal and background variables  we don’t know beforehand if it is signal or background. What do we do? for likelihood ratio, decorrelate signal and background independently signal transformation background transformation for other estimators, one needs to decide on one of the two… (or decorrelate on a mixture of signal and background events) Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Decorrelation at Work Example: linear correlated Gaussians  decorrelation works to 100% 1-D Likelihood on decorrelated sample give best possible performance compare also the effect on the MVA-output variable! correlated variables: after decorrelation (note the different scale on the y-axis… sorry) Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? Original correlations Signal Background Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? SQRT decorrelation Signal Background Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases ? Original correlations Signal Background Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases ? SQRT decorrelation Signal Background Watch out before you used decorrelation “blindly”!! Perhaps “decorrelate” only a subspace! Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

“Gaussian-isation“ Improve decorrelation by pre-Gaussianisation of variables First: transformation to achieve uniform (flat) distribution: Rarity transform of variable k Measured value PDF of variable k The integral can be solved in an unbinned way by event counting, or by creating non-parametric PDFs (see later for likelihood section) Second: make Gaussian via inverse error function: Third: decorrelate (and “iterate” this procedure) Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Background - Gaussianised “Gaussian-isation“ Original Background - Gaussianised Signal - Gaussianised We cannot simultaneously “Gaussianise” both signal and background ! Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Summary Hope you are all convinced that Multivariate Algorithem are nice and powerful classification techniques Do not use hard selection criteria (cuts) on each individual observables Look at all observables “together” eg. combing them into 1 variable Mulitdimensinal Likelihood  PDF in D-dimensions Projective Likelihood (Naïve Bayesian)  PDF in D times 1 dimension How to “avoid” correlations Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011