Multivariate Methods of

Slides:

Advertisements

Similar presentations

S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

CS479/679 Pattern Recognition Dr. George Bebis

Data Mining Classification: Alternative Techniques

Computer vision: models, learning and inference Chapter 8 Regression.

Chapter 4: Linear Models for Classification

Introduction to Statistics and Machine Learning

Pattern Recognition and Machine Learning

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

Crash Course on Machine Learning

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

Outline Separating Hyperplanes – Separable Case

G. Cowan Lectures on Statistical Data Analysis Lecture 7 page 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem 2Random variables and.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 3: Multivariate Methods (II) 清华大学高能物理研究中心 2010 年 4 月 12—16.

Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

1 E. Fatemizadeh Statistical Pattern Recognition.

Applying Statistical Machine Learning to Retinal Electrophysiology Matt Boardman January, 2006 Faculty of Computer Science.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Final Review Course web page: vision.cis.udel.edu/~cv May 21, 2003  Lecture 37.

1 Introduction to Statistics − Day 2 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.

G. Cowan Lectures on Statistical Data Analysis Lecture 6 page 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem 2Random variables and.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Helge Voss MAX-PLANCK-INSTITUT FÜR KERNPHYSIK IN HEIDELBERG Multivariate Analysis (Machine Learning) … in HEP Helge VossSOS 2016, Autrans France, Multivariate.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Helge VossAdvanced Scientific Computing Workshop ETH Multivariate Methods of data analysis Helge Voss Advanced Scientific Computing Workshop ETH.

Helge VossINFN School of Statistics Multivariate Discriminators Helge Voss INFN School of Statistics 2015 Ischia (Napoli, Italy) MAX-PLANCK-INSTITUT.

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

PREDICT 422: Practical Machine Learning

Multivariate Methods of

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Ch8: Nonparametric Methods

CH 5: Multivariate Methods

Multivariate Analysis Techniques

Christoph Rosemann and Helge Voss DESY FLC

Multi-dimensional likelihood

Machine Learning Basics

Overview of Supervised Learning

K Nearest Neighbor Classification

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Recognition and Image Analysis

COSC 4335: Other Classification Techniques

Pattern Recognition and Machine Learning

Computing and Statistical Data Analysis Stat 5: Multivariate Methods

Generally Discriminant Analysis

TRISEP 2016 / Statistics Lecture 2

Model generalization Brief summary of methods

Parametric Methods Berlin Chen, 2005 References:

Nonparametric density estimation and classification

Multivariate Methods Berlin Chen

COSC 4368 Machine Learning Organization

Multivariate Methods Berlin Chen, 2005 References:

Linear Discrimination

What is Artificial Intelligence?

Presentation transcript:

Multivariate Methods of data analysis Advanced Scientific Computing Workshop ETH 2014 Helge Voss MAX-PLANCK-INSTITUT FÜR KERNPHYSIK IN HEIDELBERG Helge Voss Advanced Scientific Computing Workshop ETH 2014

Overview Multivariate classification/regression algorithms (MVA) what they are how they work Overview over some classifiers Multidimensional Likelihood (kNN : k-Nearest Neighbour) Projective Likelihood (naïve Bayes) Linear Classifier Non linear Classifiers Neural Networks Boosted Decision Trees Support Vector Machines General comments about: Overtraining Systematic errors Helge Voss Advanced Scientific Computing Workshop ETH 2014

Event Classification Unfortunately, the true probability densities functions are typically unknown:  Neyman-Pearsons lemma doesn’t really help us directly Monte Carlo simulation or in general cases: set of known (already classified) “events” 2 different ways: Use these “training” events to: estimate the functional form of p(x|C): (e.g. the differential cross section folded with the detector influences) from which the likelihood ratio can be obtained e.g. D-dimensional histogram, Kernel densitiy estimators, … (generative algorithms) find a “discrimination function” y(x) and corresponding decision boundary (i.e. hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from background e.g. Linear Discriminator, Neural Networks, … (discriminative algorithms) * hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise Helge Voss Advanced Scientific Computing Workshop ETH 2014

K- Nearest Neighbour “events” distributed according to P(x) estimate probability density P(x) in D-dimensional space: The only thing at our disposal is our “training data” x2 h Say we want to know P(x) at “this” point “x” One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” count #training events in volume V(h)  K-events: x1  estimate of average P(x) in the volume V: ∫P(x)dx = K/N V  Kernel Density estimator of the probability density Classification: Determine PDFS(x) and PDFB(x) likelihood ratio as classifier! Helge Voss Advanced Scientific Computing Workshop ETH 2014

i.e.: the average function value K- Nearest Neighbour estimate probability density P(x) in D-dimensional space: The only thing at our disposal is our “training data” “events” distributed according to P(x) x2 h Say we want to know P(x) at “this” point “x” One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” count #training events in volume V(h)  K-events: x1  estimate of average P(x) in the volume V: ∫P(x)dx = K/N V Regression: If each events with (x1,x2) carries a “function value” f(x1,x2) (e.g. energy of incident particle)  i.e.: the average function value Helge Voss Advanced Scientific Computing Workshop ETH 2014

K- Nearest Neighbour “events” distributed according to P(x) estimate probability density P(x) in D-dimensional space: The only thing at our disposal is our “training data” x2 h Say we want to know P(x) at “this” point “x” One expects to find in a volume V around point “x” N*∫P(x)dx events from a dataset with N events V “x” count #training events in volume V(h)  K-events: x1 determine K from the “training data” with signal and background mixed together x1 x2 kNN : k-Nearest Neighbours relative number events of the various classes amongst the k-nearest neighbours Kernel Density Estimator: replace “window” by “smooth” kernel function  weight events by distance (e.g. via Gaussian) Helge Voss Advanced Scientific Computing Workshop ETH 2014

Kernel Density Estimator : a general probability density estimator using kernel K h: “size” of the Kernel  “smoothing parameter” chosen size of the “smoothing-parameter”  more important than kernel function h too small: overtraining h too large: not sensitive to features in P(x) which metric for the Kernel (window)? normalise all variables to same range include correlations ? Mahalanobis Metric: x*x  xV-1x (Christopher M.Bishop) a drawback of Kernel density estimators: Evaluation for any test events involves ALL TRAINING DATA  typically very time consuming Helge Voss Advanced Scientific Computing Workshop ETH 2014

“Curse of Dimensionality” Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. We all know: Filling a D-dimensional histogram to get a mapping of the PDF is typically unfeasable due to lack of Monte Carlo events. Shortcoming of nearest-neighbour strategies: higher dimensional cases K-events often are not in a small “vicinity” of the space point anymore: consider: total phase space volume V=1D for a cube of a particular fraction of the volume: 10 dimensions: capture 1% of the phase space  63% of range in each variable necessary  that’s not “local” anymore.. develop all the alternative classification/regression techniques Helge Voss Advanced Scientific Computing Workshop ETH 2014

Naïve Bayesian Classifier (projective Likelihood Classifier) Multivariate Likelihood (k-Nearest Neighbour)  estimate the full D-dimensional joint probability density product of marginal PDFs (1-dim “histograms”) If correlations between variables are weak:  discriminating variables Classes: signal, background types Likelihood ratio for event event PDFs One of the first and still very popular MVA-algorithm in HEP No hard cuts on individual variables, allow for some “fuzzyness”: one very signal like variable may counterweigh another less signal like variable optimal method if correlations == 0 (Neyman Pearson Lemma) try to “eliminate” correlations  e.g. linear de-correlation PDE introduces fuzzy logic Helge Voss Advanced Scientific Computing Workshop ETH 2014

Naïve Bayesian Classifier (projective Likelihood Classifier) Where to get the 1-D PDF’s ? Simple histograms Smoothing (e.g. spline or kernel function) Helge Voss Advanced Scientific Computing Workshop ETH 2014

What if there are correlations? Typically correlations are present: Cij=cov[ xi , x j ]=E[ xi xj ]−E[ xi ]E[ xj ]≠0 (i≠j)  pre-processing: choose set of linear transformed input variables for which Cij = 0 (i≠j) Helge Voss Advanced Scientific Computing Workshop ETH 2014

De-Correlation Find variable transformation that diagonalises the covariance matrix Determine square-root C  of correlation matrix C, i.e., C = C C  compute C  by diagonalising C: transformation from original (x) in de-correlated variable space (x) by: x = C 1x Attention: eliminates only linear correlations!! Helge Voss Advanced Scientific Computing Workshop ETH 2014

Decorrelation at Work Example: linear correlated Gaussians  decorrelation works to 100% 1-D Likelihood on decorrelated sample give best possible performance compare also the effect on the MVA-output variable! correlated variables: after decorrelation Watch out! Things might look very different for non-linear correlations! Helge Voss Advanced Scientific Computing Workshop ETH 2014

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? Original correlations Signal Background Helge Voss Advanced Scientific Computing Workshop ETH 2014

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? SQRT decorrelation Signal Background Helge Voss Advanced Scientific Computing Workshop ETH 2014

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases ? Original correlations Signal Background Helge Voss Advanced Scientific Computing Workshop ETH 2014

Limitations of the Decorrelation in cases with non-Gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases ? SQRT decorrelation Signal Background Watch out before you used decorrelation “blindly”!! Perhaps “decorrelate” only a subspace! Helge Voss Advanced Scientific Computing Workshop ETH 2014

Summary Hope you are all convinced that Multivariate Algorithem are nice and powerful classification techniques Do not use hard selection criteria (cuts) on each individual observables Look at all observables “together” eg. combing them into 1 variable Mulitdimensinal Likelihood  PDF in D-dimensions Projective Likelihood (Naïve Bayesian)  PDF in D times 1 dimension  Be careful about correlations Helge Voss Advanced Scientific Computing Workshop ETH 2014