Helge VossAdvanced Scientific Computing Workshop ETH 2014 1 Multivariate Methods of data analysis Helge Voss Advanced Scientific Computing Workshop ETH.

Slides:



Advertisements
Similar presentations
Machine learning continued Image source:
Advertisements

Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
Introduction to Statistics and Machine Learning
Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret our measurements How do we get the data for our measurements.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
x – independent variable (input)
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Decision Theory, Bayes Classifier
Machine Learning CMPT 726 Simon Fraser University
Data mining and statistical learning - lecture 13 Separating hyperplane.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Crash Course on Machine Learning
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 3: Multivariate Methods (II) 清华大学高能物理研究中心 2010 年 4 月 12—16.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Statistics In HEP Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
1 Introduction to Statistics − Day 2 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Multivariate Classifiers or “Machine Learning” in TMVA
G. Cowan Lectures on Statistical Data Analysis Lecture 6 page 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Helge Voss MAX-PLANCK-INSTITUT FÜR KERNPHYSIK IN HEIDELBERG Multivariate Analysis (Machine Learning) … in HEP Helge VossSOS 2016, Autrans France, Multivariate.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Helge VossINFN School of Statistics Multivariate Discriminators Helge Voss INFN School of Statistics 2015 Ischia (Napoli, Italy) MAX-PLANCK-INSTITUT.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 1.31 Criteria for optimal reception of radio signals.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
Oliver Schulte Machine Learning 726
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Multivariate Methods of
LECTURE 11: Advanced Discriminant Analysis
Multivariate Methods of
CH 5: Multivariate Methods
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
COMP61011 : Machine Learning Ensemble Models
Multi-dimensional likelihood
Support Vector Machines (SVM)
Data Mining Lecture 11.
Overview of Supervised Learning
ISTEP 2016 Final Project— Project on
Unfolding Problem: A Machine Learning Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
Ensemble learning.
Generally Discriminant Analysis
TRISEP 2016 / Statistics Lecture 2
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Nonparametric density estimation and classification
Support Vector Machines 2
Presentation transcript:

Helge VossAdvanced Scientific Computing Workshop ETH Multivariate Methods of data analysis Helge Voss Advanced Scientific Computing Workshop ETH 2014 MAX-PLANCK-INSTITUT FÜR KERNPHYSIK IN HEIDELBERG

Overview 2 Helge VossAdvanced Scientific Computing Workshop ETH 2014  Multivariate classification/regression algorithms (MVA)  what they are  how they work  Overview over some classifiers  Multidimensional Likelihood (kNN : k-Nearest Neighbour)  Projective Likelihood (naïve Bayes)  Linear Classifier  Non linear Classifiers  Neural Networks  Boosted Decision Trees  Support Vector Machines  General comments about:  Overtraining  Systematic errors

Event Classification 3 Helge VossAdvanced Scientific Computing Workshop ETH 2014 A linear boundary?A nonlinear one?Rectangular cuts?  Which model/class ? Pro and cons ?  Once decided on a class of boundaries, how to find the “optimal” one ?  Discriminate Signal from Background  how to set the decision boundary to select events of type S ?  we have discriminating variables x 1, x 2, … Low variance (stable), high bias methodsHigh variance, small bias methods

Regression 4 Helge VossAdvanced Scientific Computing Workshop ETH 2014 linear? x f(x) x x constant ? non - linear?  estimate “functional behaviour” from a set of ‘known measurements” ?  e.g. : photon energy as function “D”-variables ECAL shower parameters + …  seems trivial ?  what if you have many input variables? Cluster Size Energy  seems trivial ?  h uman brain has very good pattern recognition capabilities!  known analytic model (i.e. nth -order polynomial)  Maximum Likelihood Fit)  no model ?  “draw any kind of curve” and parameterize it?

Regression  model functional behaviour 5 Helge VossAdvanced Scientific Computing Workshop ETH 2014 x f(x)  Estimate the ‘Functional Value’  From measured parameters 1-D example 2-D example  better known: (linear) regression  fit a known analytic function  e.g. the above 2-D example  reasonable function would be: f(x) = ax 2 +by 2 +c  don’t have a reasonable “model” ?  need something more general:  e.g. piecewise defined splines, kernel estimators, decision trees to approximate f(x) x y f(x,y) events generated according:underlying distribution  NOT in order to “fit a parameter”  provide prediction of function value f(x) for new measurements x (where f(x) is not known)

Event Classification 6 Helge VossAdvanced Scientific Computing Workshop ETH 2014  Each event, if Signal or Background, has “D” measured variables.  D “feature space” y(x)  most general form y = y(x); x  D x={x 1,….,x D }: input variables y(x): R D  R:  plotting (historamming) the resulting y(x) values:  Find a mapping from D-dimensional input-observable =”feature” space to one dimensional output  class label

Event Classification 7 Helge VossAdvanced Scientific Computing Workshop ETH 2014  Each event, if Signal or Background, has “D” measured variables.  D “feature space” y(B)  0, y(S)  1   y(x): “test statistic” in D-dimensional space of input variables y(x): R n  R:  distributions of y(x): PDF S (y) and PDF B (y)  overlap of PDF S (y) and PDF B (y)  separation power, purity  used to set the selection cut!  Find a mapping from D-dimensional input/observable/”feature” space  y(x)=const: surface defining the decision boundary.  efficiency and purity to one dimensional output  class labels > cut: signal = cut: decision boundary < cut: background y(x):

Classification ↔ Regression 8 Helge VossAdvanced Scientific Computing Workshop ETH 2014 Classification:  Each event, if Signal or Background, has “D” measured variables.  D “feature space”   y(x): R D  R: “test statistic” in D-dimensional space of input variables  y(x)=const: surface defining the decision boundary. y(x): R D  R: Regression:  Each event has “D” measured variables + one function value (e.g. cluster shape variables in the ECAL + particles energy)  y(x): R D  R “regression function”  y(x)=const  hyperplanes where the target function is constant Now, y(x) needs to be build such that it best approximates the target, not such that it best separates signal from bkgr. X1X1 X2X2 f(x 1,x 2 )

Classification and Regression Visualisation in 2D 9 Helge VossAdvanced Scientific Computing Workshop ETH 2014 X1X1 X2X2 y(x 1,x 2 ) X1X1 X2X2 X2X2 X1X1 X2X2 X1X1

Event Classification 10 Helge VossAdvanced Scientific Computing Workshop ETH 2014 y(x) PDF B (y). PDF S (y): normalised distribution of y=y(x) for background and signal events (i.e. the “function” that describes the shape of the distribution) with y=y(x) one can also say PDF B (y(x)), PDF S (y(x)): : Probability densities for background and signal now let’s assume we have an unknown event from the example above for which y(x) = 0.2 is the probability of an event with measured x={x 1,….,x D } that gives y(x) to be of type signal y(x): R n  R: the mapping from the “feature space” (observables) to one output variable let f S and f B be the fraction of signal and background events in the sample, then:  PDF B (y(x)) = 1.5 and PDF S (y(x)) =

Event Classification 11 Helge VossAdvanced Scientific Computing Workshop ETH 2014 P(Class=C|x) (or simply P(C|x)) : probability that the event class is of C, given the measured observables x={x 1,….,x D }  y(x) Overall probability density to observe the actual measurement y(x). i.e. Probability density distribution according to the measurements x and the given mapping function Posterior probability  It’s a nice “exercise” to show that this application of Bayes’ Theorem gives exactly the formula on the previous slide !

which one of those two blue ones is the better?? Receiver Operation Charactersic (ROC) curve 12 Helge VossAdvanced Scientific Computing Workshop ETH 2014 y(x) y ( B )  0, y ( S )  1 Signal(H 1 ) /Background(H 0 ) discrimination: y’(x) y’’(x) Type-1 error small Type-2 error large Type-1 error large Type-2 error small Signal(H 1 ) /Background(H 0 ) :  Type 1 error: reject H 0 although true  background contamination  Significance α: background sel. efficiency 1  : background rejection  Type 2 error: accept H 0 although false  loss of efficiency  Power: 1  β signal selection efficiency

Event Classification 13 Helge VossAdvanced Scientific Computing Workshop ETH 2014  Unfortunately, the true probability densities functions are typically unknown:  Neyman-Pearsons lemma doesn’t really help us directly * hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise  Monte Carlo simulation or in general cases: set of known (already classified) “events”  2 different ways: Use these “training” events to:  estimate the functional form of p(x|C): (e.g. the differential cross section folded with the detector influences) from which the likelihood ratio can be obtained  e.g. D-dimensional histogram, Kernel densitiy estimators, …  (generative algorithms)  find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background  e.g. Linear Discriminator, Neural Networks, …  (discriminative algorithms)

MVA and Machine Learning 14 Helge VossAdvanced Scientific Computing Workshop ETH 2014  Finding y(x) : R n  R given a certain type of model class y(x) “fits” (learns) from events with known type parameters in y(x) such that y:  CLASSIFICATION: separates well Signal from Background in training data  REGRESSION: fits well the target function for training events use for yet unknown events  predictions  supervised machine learning  Of course… there’s no magic, we still need to: choose the discriminating variables choose the class of models (linear, non-linear, flexible or less flexible) tune the “learning parameters”  bias vs. variance trade off check generalization properties consider trade off between statistical and systematic uncertainties

Summary 15 Helge VossAdvanced Scientific Computing Workshop ETH 2014 Multivariate Algorithms  combine all ‘discriminating observables’ into ONE single “MVA-variable” y(x): R D  R  contains ‘all’ information from the “D”-observables  allows to place ONE final cut  corresponding to an (complicated) decision boundary in D- dimensions  may also be used to “weight” events rather than to ‘cut’ them away y(x) is found by “training”  fit of free parameters in the model y to ‘known data’