Multivariate Analysis with TMVA Basic Concept and New Features

Multivariate Analysis with TMVA Basic Concept and New Features
Helge Voss(*) (MPI–K, Heidelberg) LHCb MVA Workshop, CERN, July 10, 2009 (*) On behalf of the author team: A. Hoecker, P. Speckmayer, J. Stelzer, J.Therhaag, E.v.Toerne , H. Voss And the contributors: See acknowledgments on page xx On the web: (home), (tutorial)

Event Classification S B S B S How can we decide?
Suppose data sample of two types of events: with class labels Signal and Background (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged) how to set the decision boundary to select events of type S ? we have discriminating variables x1, x2, … Rectangular cuts? A linear boundary? A nonlinear one? S B x1 x2 S B x1 x2 S B x1 x2 How can we decide? Once decided on a class of boundaries, how to find the “optimal” one?

Regression In High Energy Physics
Most HEP event reconstruction require some sort of “function estimate”, of which we do not have an analytical expression: energy deposit in a the calorimeter: shower profile  calibration entry location of the particle in the calorimeter distance between two overlapping photons in the calorimeter … you can certainly think of more.. Maybe you could even imagine some final physics parameter that needs to be fitted to your event data and you would rather use a non analytic fit function from Monte Carlo events than an analytic parametrisation of the theory ???: reweighing method used in the LEP W-mass analysis … somewhat wild guessing, might not be reasonable to do.. Maybe your application can be successfully learned by a machine using Monte Carlo events ??

Regression how to estimate a “functional behaviour” from a given set of ‘known measurements” ? Assume for example “D”-variables that somehow characterize the shower in your calorimeter. from Monte Carlo or testbeam measurements, you have a data sample with events that measure all D-Variables x and the corresponding particle energy the particle energy as a function of the D observables x is a surface in the D+1 dimensional space given by D-observables + energy constant ? linear? non - linear? x f(x) f(x) x f(x) x seems trivial?? maybe… the human eye and brain behind have very good pattern recognition capabilities! but what if you have more variables?

Event Classification: Why MVA ?
Assume 4 Uncorrelated Gaussians: What do you think. Is this event is more “likely” signal or background ??

D  Event Classification “feature space”
Each event, if Signal or Background, has “D” measured variables. D “feature space” Find a mapping from D-dimensional input/observable/”feature” space to one dimensional output  class lables y(x): RnR: most general form y = y(x); x D x={x1,….,xD}: input variables  If one histogramms the resulting y(x) values: y(x) Who sees how this would look like for the regression problem?

D  Event Classification “feature space”
Each event, if Signal or Background, has “D” measured variables. Find a mapping from D-dimensional input/observable/”feature” space to one dimensional output  class lables y(x) y(B)  0, y(S)  1 y(x): RnR: D “feature space”  y(x): “test statistic” in D-dimensional space of input variables distributions of y(x): PDFS(y) and PDFB(y) used to set the selection cut! >cut: signal =cut: decision boundary <cut: background y(x): efficiency and purity y(x)=const: surface defining the decision boundary. overlap of PDFS(y) and PDFB(y)  separation power , purity

MVA and Machine Learning
The previous slide was basically the idea of “Multivariate Analysis” (MVA) rem: What about “standard cuts” ? Finding y(x) : RnR ( “basically” the same for regression and classification) given a certain type of model class y(x) in an automatic way using “known” or “previously solved” events i.e. learn from known “patterns” such that the resulting y(x) has good generalization properties when applied to “unknown” events  that is what the “machine” is supposed to be doing: supervised machine learning Of course… there’s no magic, we still need to: choose the discriminating variables choose the class of models (linear, non-linear, flexible or less flexible) tune the “learning parameters”  bias vs. variance trade off check generalization properties consider trade off between statistical and systematic uncertainties

H. Voss ― Multivariate Analysis with TMVA
Overview/Outline Idea: rather than just implementing new MVA techniques and making them available in ROOT (i.e. like TMulitLayerPercetron does): one common platform / interface for all MVA classification and regression algorithms. easy to use and switch between classifiers (common) data pre-processing capabilities train / test all classifiers on same data sample and evaluate consistently common analysis, application and evaluation framework (ROOT Macros, C++ executables, python scripts..) application of trained classifier/regressor using libTMVA.so and corresponding “weight files” or stand alone C++ code (latter is currently disabled (TMVA 4.01) ) Outline The TMVA project Survey of the available classifiers/regressors Toy examples General considerations and systematics LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

TMVA Development and Distribution
TMVA is a sourceforge (SF) package for world-wide access Home page ………………. SF project page …………. View SVN ………………… Mailing list .……………….. Tutorial TWiki ……………. Active project Currently 6 core developers, and 42 registered contributors at SF >4864 downloads since March 2006 (without CVS/SVN checkouts and ROOT users) current version (29 Jun 2009) try to be fast on users requests bug fixes etc. Integrated and distributed with ROOT since ROOT v5.11/ (newest version in ROOT production release 5.23) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

New (and future) TMVA Developments
MV-Regression Code re-design allowing for: boosting any classifier (not only decision trees) combination of any pre-processing algorithms classifier training in multi-regions multi-class classifiers (will slowly creep in to some classifiers…) cross-validation New MVA-Technique PDE-Foam (classification + regression) Data preprocessing Gauss-transformation arbitrary combination of preprocessing Numerous small add-ons etc.. LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

The TMVA Classifiers (Regressors)
Currently implemented classifiers : Rectangular cut optimisation Projective (1D) likelihood estimator multidimensional likelihood estimators (PDE range search, PDE-Foam, k-Nearest Neighbour) (also as Regressor) Linar discirminant analysis (Fisher, H-Matrix general Linear (also as Regressor) ) Function Discriminant (also as Regressor) Artificial neural networks (3 multilayer perceptron implementations (MLP also as Regressor with multi-dimensional target) ) Boosted/bagged/randomised decision trees (also as Regressor) Rule Ensemble Fitting Support Vector Machine (Plugin for any other MVA classifier (e.g. NeuroBayes)) Currently implemented data preprocessing : Linear decorrelation Principal component analysis Gauss transformation LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Example for Data Preprocessing: Decorrelation
Commonly realised for all classifiers in TMVA (centrally in DataSet class) Removal of linear correlations by rotating input variables using the “square-root” of the correlation matrix using the Principal Component Analysis Note that decorrelation is only complete, if Correlations are linear Input variables are Gaussian distributed original SQRT derorr. PCA derorr. LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

“Gaussian-isation” Improve decorrelation by pre-Gaussianisation of variables First: transformation to achieve uniform (flat) distribution: Transformed variable k Measured value PDF of variable k The integral can be solved in an unbinned way by event counting, or by creating non-parametric PDFs (see later for likelihood section) Second: make Gaussian via inverse error function: Third: decorrelate (and “iterate” this procedure) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Background - Gaussianised
“Gaussian-isation” Signal - Gaussianised Background - Gaussianised Original We cannot simultaneously Gaussianise both signal and background ?! LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Pre-processing The decorrelation/gauss transform can be determined from either: Signal + Background together Signal only Background only For all but the “Likelihood” methods (1D or multi-dimensional) we need to decide on either of those. (TMVA default: signal+background) For “Likelihood” methods where we test “Signal” hypothesis against “Background” hypothesis, BOTH are used. LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Rectangular Cut Optimisation
Simplest method: indiviudual cuts on each variable cuts out a single rectangular volume from the variable space Technical challenge: how to find optimal cuts ? MINUIT fails due to non-unique solution space TMVA uses: Monte Carlo sampling, Genetic Algorithm, Simulated Annealing Huge speed improvement of volume search by sorting events in binary tree Cuts usually benefit from prior decorrelation of cut variables LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Projective (1D) Likelihood Estimator (PDE Approach)
Probability density estimators for each input variable combined in likelihood estimator (ignoring correlations) Optimal approach if zero correlations (or linear  decorrelation) Otherwise: significant performance loss Technical challenge: how to estimate the PDF shapes Difficult to automate for arbitrary PDFs Automatic, unbiased, but suboptimal Easy to automate, can create artefacts/suppress information 3 ways: parametric fitting (function) event counting nonparametric fitting TMVA uses binned shape interpolation using spline functions Or: unbinned adaptive Gaussian kernel density estimation TMVA performs automatic validation of goodness-of-fit LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Multidimensional PDE Approach
Use a single PDF per event class (sig, bkg), which spans Nvar dimensions  2 different implementations of the same idea: PDE Range-Search: count number of signal and background events in “vicinity” of test event  preset or adaptive volume defines “vicinity” Carli-Koblitz, NIM A501, 576 (2003) Improve simple event counting by use of kernel functions to penalise large distance from test event x2 H1 Increase counting speed with binary search tree test event Genuine k-NN algorithm (author R. Ospanov) Intrinsically adaptive approach Very fast search with kd-tree event sorting H0 x1 PDE-Foam: self adaptive phase space binning divides space in hyperqubes as “volumes”… Regression: assume the function value distance weighted averaged over the “volume” LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Fisher’s Linear Discriminant Analysis (LDA)
Well known, simple and elegant classifier LDA determines axis in the input variable hyperspace such that a projection of events onto this axis pushes signal and background as far away from each other as possible projection axis Classifier response couldn’t be simpler: “Fisher coefficients” Function discriminant analysis (FDA) Fit any user-defined function of input variables requiring that signal events return 1 and background 0 Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Nonlinear Analysis: Artificial Neural Networks
Achieve nonlinear classifier response by “activating” output nodes using nonlinear weights 1 i . . . N 1 input layer k hidden layers 1 ouput layer j M1 Mk 2 output classes (signal and background) Nvar discriminating input variables (“Activation” function) with: Feed-forward Multilayer Perceptron Three different implementations in TMVA (all are Multilayer Perceptrons) TMlpANN: Interface to ROOT’s MLP implementation MLP: TMVA’s own MLP implementation for increased speed and flexibility CFMlpANN: ALEPH’s Higgs search ANN, translated from FORTRAN Regression: uses the real valued output of ONE node LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Boosted Decision Trees (BDT)
DT: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background BDT: combine forest of DTs, with differently weighted events in each tree (trees can also be weighted) e.g., “AdaBoost”: incorrectly classified events receive larger weight in next DT GradientBoost “Bagging”: random poisson event weights  resampling with replacement “Randomised Forest”, like bagging and usaging of random sub-sets of variables Boosting/bagging/randomising create set of “basis functions”: final classifier is linear combination of these  improves stability Bottom-up “pruning” of a decision tree Remove statistically insignificant nodes to reduce tree overtraining LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Boosted Decision Trees (BDT)
DT: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background Decision tree after pruning Decision tree before pruning Bottom-up “pruning” of a decision tree Remove statistically insignificant nodes to reduce tree overtraining Randomised Forests don’t need pruning  VERY ROBUST Some people even claim that any “boosted” decision trees don’t need to be pruned, but according to our experience, this is not always true. Regression: assume the function value of average in the leaf node LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Learning with Rule Ensembles
Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003 Model is linear combination of rules, where a rule is a sequence of cuts RuleFit classifier rules (cut sequence  rm=1 if all cuts satisfied, =0 otherwise) normalised discriminating event variables Sum of rules Linear Fisher term The problem to solve is Create rule ensemble: use forest of decision trees Fit coefficients am, bk: gradient direct regularization minimising Risk (Friedman et al.) Pruning removes topologically equal rules” (same variables in cut sequence) One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30= LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Boosting classifier C(0)(x) Training Sample Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample re-weight classifier C(3)(x) C(m)(x) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Adaptive Boosting (AdaBoost)
classifier C(0)(x) AdaBoost re-weights events misclassified by previous classifier by: Training Sample Weighted Sample re-weight classifier C(1)(x) Weighted Sample re-weight classifier C(2)(x) Weighted Sample re-weight classifier C(3)(x) C(m)(x) AdaBoost weights the classifiers also using the error rate of the individual classifier according to: LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

AdaBoost: A simple demonstration
The example: (somewhat artificial…but nice for demonstration) : Data file with three “bumps” Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps ) var(i) > x var(i) <= x B S b) a) Two reasonable cuts: a) Var0 > 0.5  εsignal=66% εbkg ≈ 0% misclassified events in total 16.5% or b) Var0 < -0.5  εsignal=33% εbkg ≈ 0% misclassified events in total 33% the training of a single decision tree stump will find “cut a)” LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

The first “tree”, choosing cut a) will give an error fraction: err = 0.165  before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5  the next “tree” sees essentially the following data sample: b) .. and hence will chose: “cut b)”: Var0 < -0.5 re-weight The combined classifier: Tree1 + Tree2 the (weighted) average of the response to a test event from both trees is able to separate signal from background as good as one would expect from the most powerful classifier LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Boosting Fisher Discriminant
y(x) = A linear combination of the input variables A linear combination of Fisher discriminant is again a Fisher discriminant. Hence “simple boosting doesn’t help” However: apply the “Fisher Discriminant” CUT in each step (or any other non-linear transformation”  then it works! LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Support Vector Machine
There are methods to create linear decision boundaries using only measures of distances  leads to quadratic optimisation processs Using variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) allows in principle to separate with linear decision boundaries non linear problems The decision boundary in the end is defined only by training events that are closest to the boundary Support Vector Machine LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Support Vector Machines
Find hyperplane that best separates signal from background x2 support vectors Best separation: maximum distance (margin) between closest events (support) to hyperplane Linear decision boundary 1 Non-separable data 2 Separable data optimal hyperplane 3 If data non-separable add misclassification cost parameter C·ii to minimisation function 4 margin Solution of largest margin depends only on inner product of support vectors (distances)  quadratic minimisation problem x1 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Support Vector Machines
Find hyperplane that best separates signal from background x1 x2 x1 x3 x2 Best separation: maximum distance (margin) between closest events (support) to hyperplane Linear decision boundary Non-separable data Separable data If data non-separable add misclassification cost parameter C·ii to minimisation function (x1,x2) Solution depends only on inner product of support vectors  quadratic minimisation problem Non-linear cases: Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x  Ф(x)·Ф(x). certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid Choose Kernel and fit the hyperplane using the linear techniques developed above Kernel size paramter typically needs careful tuning! (Overtraining!) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Data Preparation and Preprocessing
Data input format: ROOT TTree or ASCII Supports selection of any subset or combination or function of available variables Supports application of pre-selection cuts (possibly independent for signal and bkg) Supports global event weights for signal or background input files Supports use of any input variable as individual event weight Supports various methods for splitting into training and test samples: Block wise Randomly Periodically (i.e. periodically 3 testing ev., 2 training ev., 3 testing ev, 2 training ev. ….) User defined separate Training and Testing trees Preprocessing of input variables (e.g., decorrelation) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Code Flow for Training and Application Phases
Scripts can be ROOT scripts, C++ executables or python scripts (via PyROOT)  T MVA tutorial LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

MVA Evaluation Framework
TMVA is not only a collection of classifiers, but an MVA framework After training, TMVA provides ROOT evaluation scripts (through GUI) Plot all signal (S) and background (B) input variables with and without pre-processing Correlation scatters and linear coefficients for S & B Classifier outputs (S & B) for test and training samples (spot overtraining) Classifier Rarity distribution Classifier significance with optimal cuts B rejection versus S efficiency Classifier-specific plots: Likelihood reference distributions Classifier PDFs (for probability output and Rarity) Network architecture, weights and convergence Rule Fitting analysis plots Visualisation of decision trees LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Evaluating the Classifier Training Projective likelihood PDFs, MLP training, BDTs, … average no. of nodes before/after pruning: 4193 / 968 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Evaluating the Classifier Training Classifier output distributions for test and training samples … LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Evaluating the Classifier Training Optimal cut for each classifiers … Using the TMVA graphical output one can determine the optimal cut on a classifier output (working point in “statistical significance) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Evaluating the Classifier Training Background rejection versus signal efficiencies … Best plot to compare classifier performance If background in data non-uniform  problem in training sample An elegant variable is the Rarity: transforms to uniform background. Height of signal peak direct measure of classifier performance LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Evaluating the Classifiers (taken from TMVA output…)
Input Variable Ranking --- Fisher : Ranking result (top variable is best ranked) --- Fisher : --- Fisher : Rank : Variable : Discr. power --- Fisher : 1 : var : 2.175e-01 --- Fisher : 2 : var : 1.718e-01 --- Fisher : 3 : var : 9.549e-02 --- Fisher : 4 : var : 2.841e-02 Better variable How discriminating is a variable ? Classifier correlation and overlap --- Factory : Inter-MVA overlap matrix (signal): --- Factory : --- Factory : Likelihood Fisher --- Factory : Likelihood: --- Factory : Fisher: Do classifiers select the same events as signal and background ? If not, there is something to gain ! LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Evaluating the Classifiers (taken from TMVA output…)
Evaluation results ranked by best signal efficiency and purity (area) MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi- Methods: @B= @B= @B= Area | ration: cance: Fisher : 0.268(03) (03) (02) | MLP : 0.266(03) (03) (02) | LikelihoodD : 0.259(03) (03) (02) | PDERS : 0.223(03) (03) (02) | RuleFit : 0.196(03) (03) (02) | HMatrix : 0.058(01) (03) (02) | BDT : 0.154(02) (04) (03) | CutsGA : 0.109(02) (00) (03) | Likelihood : 0.086(02) (03) (03) | Testing efficiency compared to training efficiency (overtraining check) MVA Signal efficiency: from test sample (from traing sample) Methods: @B= @B= @B=0.30 Fisher : (0.275) (0.658) (0.873) MLP : (0.278) (0.658) (0.873) LikelihoodD : (0.273) (0.657) (0.872) PDERS : (0.389) (0.691) (0.881) RuleFit : (0.198) (0.616) (0.848) HMatrix : (0.060) (0.623) (0.868) BDT : (0.268) (0.736) (0.911) CutsGA : (0.123) (0.424) (0.715) Likelihood : (0.092) (0.379) (0.677) Better classifier Check for over-training LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Subjective Summary of TMVA Classifier Properties
Criteria Classifiers Cuts Likeli-hood PDERS/ k-NN H-Matrix Fisher MLP BDT Rule fitting SVM Perfor-mance no / linear correlations   nonlinear correlations  Speed Training Response / Robust-ness Overtraining Weak input variables Curse of dimensionality Transparency The properties of the Function discriminant (FDA) depend on the chosen function LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Regression Example Example: Target as a quadratic function of 2 variables approximated by: “Linear Regressor” Neural Network LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

NEW! TMVA Users Guide for 4.0 web page with quick reference summary of all training options! Available on TMVA Users Guide 97pp, incl. code examples arXiv physics/ LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Some T o y E x a m p l e s LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Events weighted by SVM response
The “Schachbrett” Toy (chess board) Event distribution Performance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiers After some parameter tuning, also SVM und ANN(MLP) perform equally well Theoretical maximum Events weighted by SVM response LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

More Toys: Linear-, Cross-, Circular Correlations
Illustrate the behaviour of linear and nonlinear classifiers Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background) LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

How does linear decorrelation affect strongly nonlinear cases ? Original correlations SQRT decorrelation PCA decorrelation LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Weight Variables by Classifier Output
How well do the classifier resolve the various correlation patterns ? Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Fisher MLP BDT PDERS Likelihood Likelihood - D LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Final Classifier Performance
Background rejection versus signal efficiency curve: Circular Example Cross Example Linear Example LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Stability with Respect to Irrelevant Variables
Toy example with 2 discriminating and 4 non-discriminating variables ? LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Toy example with 2 discriminating and 4 non-discriminating variables ? use only two discriminant variables in classifiers LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Toy example with 2 discriminating and 4 non-discriminating variables ? use all discriminant variables in classifiers use only two discriminant variables in classifiers LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

g e n e r a l r e m a r k s a n d a b o u t s y s t e m a t i c s LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

General Advice for (MVA) Analyses
There is no magic in MVA-Methods: no need to be too afraid of “black boxes” you typically still need to make careful tuning and do some “hard work” The most important thing at the start is finding good observables good separation power between S and B little correlations amongst each other no correlation with the parameters you try to measure in your signal sample! Think also about possible combination of variables can this eliminate correlation Always apply straightforward preselection cuts and let the MVA only do the rest. “Sharp features should be avoided”  numerical problems, loss of information when binning is applied simple variable transformations (i.e. log(var1) ) can often smooth out these areas and allow signal and background differences to appear in a clearer way Treat regions in the detector that have different features “independent” can introduce correlations where otherwise the variables would be uncorrelated! Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

Some Words about Systematic Errors
Typical worries are: What happens if the estimated “Probability Density” is wrong ? Can the Classifier, i.e. the discrimination function y(x), introduce systematic uncertainties? What happens if the training data do not match “reality” Any wrong PDF leads to imperfect discrimination function Imperfect ! (calling it “wrong” wouldn’t “right”) y(x)  loss of discrimination power that’s all! classical cuts face exactly the same problem, But: in addition to cutting on features that are not correct, now you can also “exploit” correlations that are in fact not correct  HOWEVER: see next slide Systematic error are only introduced once “Monte Carlo events” with imperfect modeling are used for efficiency; purity #expected events same problem with classical “cut” analysis use control samples to test MVA-output distribution (y(x)) Combined variable (MVA-output, y(x)) might “hide” problems in ONE individual variable train classifier with few variables only and compare with data check all input variables individually ANYWAY Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

Systematic “Error” in Correlations
Use as training sample events that have correlatetions optimize CUTs train an propper MVA (e.g. Likelihood, BDT) Assume in “real data” there are NO correlations  SEE what happens!! Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

Compare “Data” (TestSample) and Monte-Carlo (both taken from the same underlying distribution) Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

Compare “Data” (TestSample) and Monte-Carlo (both taken from the same underlying distributions that differ by the correlation!!! ) Differences are ONLY visible in the MVA-output plots (and if you’d look at cut sequences….) Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

Treatment of Systematic Uncertainties
Is there a strategy however to become ‘less sensitive’ to possible systematic uncertainties i.e. classically: variable that is prone to uncertainties  do not cut in the region of steepest gradient classically one would not choose the most important cut on an uncertain variable Try to make classifier less sensitive to “uncertain variables” i.e. re-weight events in training to decrease separation in variables with large systematic uncertainty “Calibration uncertainty” may shift the central value and hence worsen (or increase) the discrimination power of “var4” (certainly not yet a recipe that can strictly be followed, more an idea of what could perhaps be done) Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

1st Way Classifier output distributions for signal only Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

2nd Way Classifier output distributions for signal only Helge Voss Graduierten-Kolleg, Freiburg, Mai ― Multivariate Data Analysis and Machine Learning

C o p y r i g h t s & C r e d i t s TMVA is open source software Use & redistribution of source permitted according to terms in BSD license Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland). Backup slides on: (i) some illustrative toy examples (ii) treatment of systematic uncertainties (iii) sensitivity to weak input variables LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Questions/Discussion points
How to treat systematic uncertainties in MVA methods? check multidimensioal training samples including their correlations which differences we care about? Can one give a particular classifier that is: always the best? always the best for specific class of models? How could one teach the classifiers to optimise for certain regions of interest (i.e. very large background rejection only ?) is that actually necessary, or does the cut on the MVA output provide just this? Train according to “smallest error on measurement” rather than best separation of signal/background Anyone a good idea for maybe better treatment of negative weights? How to best divide the data sample into “test and training” sample? Do we want an additional validation sample? automatic tuning of parameters cross validation LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Bias Variance Trade off -- Overtraining
high bias low variance low bias high variance prediction error test sample Overtraining! training sample low high model complexity S B x1 x2 x2 S B x1 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Cross Validation many (all) classifiers have tuning parameters “a” that need to be controlled against overtraining number of training cycles, number of nodes (neural net) smoothing parameter h (kernel density estimator) …. the more free parameter a classifier has to adjust internally  more prone to overtraining more training data  better training results division of data set into “training” and “test” sample reduces the “training” data Cross Validation: divide the data sample into say 5 sub-sets Train Test Train Test Train Test Train Test Train Test train 5 classifiers: yi(x,a) : i=1,..5, classifier yi(x,a) is trained without the i-th sub sample calculate the test error: choose tuning parameter a for which CV(a) is minimum and train the final classifier using all data LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Minimisation Robust global minimum finder needed at various places in TMVA Brute force method: Monte Carlo Sampling Sample entire solution space, and chose solution providing minimum estimator Good global minimum finder, but poor accuracy Default solution in HEP: (T)Minuit/Migrad [ How much longer do we need to suffer …. ? ] Gradient-driven search, using variable metric, can use quadratic Newton-type solution Poor global minimum finder, gets quickly stuck in presence of local minima Specific global optimisers implemented in TMVA: Genetic Algorithm: biology-inspired optimisation algorithm Simulated Annealing: slow “cooling” of system to avoid “freezing” in local solution TMVA allows to chain minimisers For example, one can use MC sampling to detect the vicinity of a global minimum, and then use Minuit to accurately converge to it. LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Minimisation Techniques
Monte Carlo Brute force methods: Random Monte Carlo or Grid search Sample entire solution space, and chose solution providing minimum estimator Good global minimum finder, but poor accuracy Grid Search Default solution in HEP: Minuit Gradient-driven search, using variable metric, can use quadratic Newton-type solution Poor global minimum finder, gets quickly stuck in presence of local minima Minuit Genetic Algorithm Biology-inspired “Genetic” representation of points in the parameter space Uses mutation and “crossover” Finds approximately global minima Simulated Annealing Like heating up metal and slowly cooling it down (“annealing”) Atoms in metal move towards the state of lowest energy while for sudden cooling atoms tend to freeze in intermediate higher energy states  slow “cooling” of system to avoid “freezing” in local solution LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Binary Trees Tree data structure in which each node has at most two children Typically the child nodes are called left and right Binary trees are used in TMVA to implement binary search trees and decision trees Root node < > Amount of computing time to search for events: Box search:  (Nevents)Nvar BT search:  Nevents·Nvarln2(Nevents) Depth of a tree Leaf node LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Gauss-Decorr-Gauss etc..
GDG GDGDGDG LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA

Multivariate Analysis with TMVA Basic Concept and New Features

Similar presentations

Presentation on theme: "Multivariate Analysis with TMVA Basic Concept and New Features"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multivariate Analysis with TMVA Basic Concept and New Features

Similar presentations

Presentation on theme: "Multivariate Analysis with TMVA Basic Concept and New Features"— Presentation transcript:

Similar presentations

About project

Feedback