DESY Computing Seminar Hamburg

DESY Computing Seminar Hamburg 14.1.2008
Machine Learning with TMVA A ROOT based Tool for Multivariate Data Analysis DESY Computing Seminar Hamburg The TMVA developer team: Andreas Höcker, Peter Speckmeyer, Jörg Stelzer, Helge Voss

General Event Classification Problem
Event described by k variables (that are found to be discriminating)  (xi)  k Events can be classified into n categories: H1 … Hn General classifier: f: k  , (xi)  {1,…,n} TMVA: only n=2 Commonly the case in HEP (signal/background) Most classification methods f: k  d, (xi)(yi) Further: d  , (yi){1,…,n} TMVA: d=1  y≥ysep: signal, y<ysep: background H2 H1 x1 x2 H3 Example: k=2, n=3 DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Example: k=2 variables x1,2, n=3 categories H1, H2, H3 The problem: How to draw the boundaries between H1, H2, and H3 such that f(x) returns the true nature of x with maximum correctness H2 H1 x1 x2 H3 Rectangular Cuts ? H2 H1 x1 x2 H3 Linear Boundaries ? H2 H1 x1 x2 H3 Non-linear Boundaries ? Simple example  I can do it by hand. DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Large input variable space, complex correlations: manual optimization very difficult 2 general ways to build f(x): Supervised learning: in an event sample the category of each event is known. Machine adapts to give the smallest misclassification error on training sample. Unsupervised learning: the correct category of each event is unknown. Machinery tries to discover structures in the dataset All classifiers in TMVA are supervised learning methods What is the optimal boundary f(x) to separate the categories More pragmatic: Which classifier is best to find this optimal boundary (or estimates it closest) Machine Learning DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Classification Problems in HEP
In HEP mostly two class problems – signal (S) and background (B) Event level (Higgs searches, …) Cone level (Tau-vs-jet reconstruction, …) Track level (particle identification, …) Lifetime and flavour tagging (b-tagging, …) ... Input information Kinematic variables (masses, momenta, decay angles, …) Event properties (jet/lepton multiplicity, sum of charges, …) Event shape (sphericity, Fox-Wolfram moments, …) Detector response (silicon hits, dE/dx, Cherenkov angle, shower profiles, muon hits, …) … DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Classifiers in TMVA

Rectangular Cut Optimization
Intuitive and simple: rectangular volumes in variable space Technical challenge: cut optimization: MINUIT fit: (simplex) was found not to be reliable Monte Carlo sampling: random scanning of parameter space inefficient for large number of input variables Genetic algorithm: preferred method Samples of cut-sets (a population) are evaluated, the fittest individuals are cross-bred (including mutation) to create a new generation The Genetic Algorithm can also be used as standalone optimizer, outside the TMVA framework Simulated annealing: still need to optimize its performance Simulated slow cooling of metal, introduce temperature dependent perturbation probability to recover from local minima Cuts usually benefit from prior decorrelation of cut variables DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Projective Likelihood Estimator (PDE)
Probability density estimators for each input variable combined in likelihood estimator Optimal MVA approach, if variables are uncorrelated In practice rarely the case, solution: de-correlate input or use different method Reference PDFs are automatically generated from training data: Histograms (counting), splines (order 2,3,5), or unbinned kernel estimator Output of likelihood estimator often strongly peaked at 0 and 1. To ease output parameterization TMVA applies inverse Fermi transformation. Reference PDF’s DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Estimating PDF Kernels
Technical challenge: how to estimate the PDF shapes 3 ways: We have chosen to implement nonparametric fitting in TMVA Binned shape interpolation using spline functions (orders: 1, 2, 3, 5) Unbinned kernel density estimation (KDE) with Gaussian smearing TMVA performs automatic validation of goodness-of-fit Easy to automate, can create artefacts/suppress information Difficult to automate for arbitrary PDFs parametric fitting (function) nonparametric fitting event counting Automatic, unbiased, but suboptimal original distribution is Gaussian DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Multivariate Analysis with TMVA - Jörg Stelzer
Multidimensional PDE Extension of the one-dimensional PDE approach to n dimensions Counts signal and background reference events (training sample) in the vicinity V of the test event Volume V definition: Size: fixed (defined by the data: % of Max-Min or RMS) or adaptive (define by number of events in search volume) Shape: box or ellipsoid Improve yPDERS estimate within V by using various n-D kernel estimators (function of the (normalized) distance between test- and reference events) Practical challenges: Need very large training sample (curse of dimensionality of kernel based methods) No training, slow evaluation. Search speed improvement with kd-tree event sorting Carli-Koblitz, NIM A501, 576 (2003) H1 H0 x1 x2 test event DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Fisher’s Linear Discriminant Analysis
Well-known, simple and elegant MVA method Fisher analysis determines an axis in the input variable hyperspace (F1,…,Fn, such that a projection of events onto this axis separates signal and background as much as possible Optimal for linearly correlated Gaussian variables with different S and B means Variable v with the same S and B sample mean  Fv=0 Projection: W: sum of S and B covariance matrices Fisher Coefficients: classifier: Function discriminant analysis (FDA) Fit any user-defined function of input variables requiring that signal events return 1 and background 0 Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator New DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Artificial Neural Network (ANN)
Multilayer perceptron: fully connected, feed forward, k hidden layers ANNs are non-linear discriminants Non linearity from activation function. (Fisher is an ANN with linear activation function) Training: back-propagation method Randomly feed signal and background events to MLP and compare the desired output {0,1} with the received output (0,1): ε = d - r Correct weights, depending on ε and learning rate η 1 input layer k hidden layers 1 ouput layer 1 1 . . . 1 . . . . . . . . . 1 output variable Nvar discriminating input variables i j Mk . . . . . . Weierstrass theorem: MLP can approximate every continuous function to arbitrary precision with just one layer and infinite number of nodes N M1 y’j Typical activation function A v’j DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Boosted Decision Trees (BDT)
A DT is a series of cuts that split sample set into ever smaller sets, leafs are assigned either S or B status Classifies events by following a sequence of cuts depending on the events variable content until a S or B leaf Growing Each split try to maximizing gain in separation (Gini-index) DT dimensionally robust and easy to understand but not powerful 1. Pruning Bottom-up pruning of a decision tree Protect from overtraining by removing statistically insignificant nodes S,B S1,B1 S2,B2 2. Boosting (Adaboost) Increase the weight of incorrectly identified events  build new DT Final classifier: ‘forest’ of DT’s linearly combined Large coefficient for DT with small misclassification Improved performance and stability BDT requires only little tuning to achieve good performance DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Predictive Learning via Rule Ensembles (RuleFit)
Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003 Following RuleFit approach by Friedman-Popescu Model is linear combination of rules, where a rule is a sequence of cuts defining a region in the input parameter space The problem to solve is Create rule ensemble: use forest of decision trees either from a BDT, or from a random forest generator (TMVA) Fit coefficients am, bk, minimizing risk of misclassification (Friedman et al.) Pruning removes topologically equal rules” (same variables in cut sequence) rules (cut sequence  rm=1 if all cuts satisfied, =0 otherwise) normalised discriminating event variables RuleFit classifier Linear Fisher term Sum of rules DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Support Vector Machine
Find hyperplane that between linearly separable signal and background (1962) Best separation: maximum distance (margin) between closest events (support) to hyperplane Wrongly classified events add extra term to cost-function which is minimized x1 x3 x2 x1 x2 x2 support vectors Non-separable data Separable data optimal hyperplane (x1,x2) margin x1 Non-linear cases: Transform variables into higher dimensional space where again a linear boundary (hyperplane) can separate the data (only mid-’90) Explicit transformation form not required, cost function depends on scalar product between events: use Kernel Functions to approximate scalar products between transformed vectors in the higher dimensional space Choose Kernel and fit the hyperplane using the linear techniques developed above Available Kernels: Gaussian, Polynomial, Sigmoid DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Data Preprocessing: Decorrelation
Various classifiers perform sub-optimal in the presence of correlations between input variables (Cuts, Projective LH), others are slower (BDT, RuleFit) Removal of linear correlations by rotating input variables Determine square-root C of covariance matrix C, i.e., C = CC Transform original (xi) into decorrelated variable space (xi) by: x = C 1x Also implemented Principal Component Analysis (PCA) Note that decorrelation is only complete, if Correlations are linear Input variables are Gaussian distributed Not very accurate conjecture in general original SQRT derorr. PCA derorr. DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Is there a best Classifier
Performance In the presence/absence of linear/nonlinear correlations Speed Training / evaluation time Robustness, stability Sensitivity to overtraining, weak input variables Size of training sample Dimensional scalability Do performance, speed, and robustness deteriorate with large dimensions Clarity Can the learning procedure/result be easily understood/visualized DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

No Single Best    / Criteria Classifiers DESY, Hamburg 14.1.2008
Cuts Likeli-hood PDERS/ k-NN H-Matrix Fisher MLP BDT RuleFit SVM Perfor-mance no / linear correlations   nonlinear correlations  Speed Training Response / Robust-ness Overtraining Weak input variables Curse of dimensionality Clarity DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

What is TMVA Motivation: Classifiers perform very different depending on the data, all should be tested on a given problem Situation for many year: usually only a small number of classifiers were investigated by analysts Needed a Tool that enables the analyst to simultaneously evaluate the performance of a large number of classifiers on his/her dataset Design Criteria: Performance and Convenience (A good tool does not have to be difficult to use) Training, testing, and evaluation of many classifiers in parallel Preprocessing of input data: decorrelation (PCA, Gaussianization) Illustrative tools to compare performance of all classifiers (ranking of classifiers, ranking of input variable, choice of working point) Actively protect against overtraining Straight forward application to test data Special needs of high energy physics addressed Two classes, events weights, familiar terminology DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Using TMVA A typical TMVA analysis consists of two main steps:
Training phase: training, testing and evaluation of classifiers using data samples with known signal and background composition Application phase: using selected trained classifiers to classify unknown data samples Using TMVA

Technical Aspects TMVA is open source, written in C++, and based on and part of ROOT Development on SourceForge, there is all the information Bundled with ROOT since Training requires ROOT-environment, resulting classifiers also available as standalone C++ code Six core developers, many contributors > 1400 downloads since Mar 2006 (not counting ROOT users) Mailing list for reporting problems Users Guide at 97p., classifier descriptions, code examples arXiv physics/ DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Training with TMVA User usually starts with template TMVAnalysis.C Choose training variables Choose input data Select classifiers (adjust training options – described in the manual by specifying option ‘H’) TMVA GUI Template TMVAnalysis.C (also .py) available at $TMVA/macros/ and $ROOTSYS/tmva/test/ DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Evaluation Output Remark on overtraining
Evaluation results ranked by best signal efficiency and purity (area) MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi- Methods: @B= @B= @B= Area | ration: cance: Fisher : 0.268(03) (03) (02) | MLP : 0.266(03) (03) (02) | LikelihoodD : 0.259(03) (03) (02) | PDERS : 0.223(03) (03) (02) | RuleFit : 0.196(03) (03) (02) | HMatrix : 0.058(01) (03) (02) | BDT : 0.154(02) (04) (03) | CutsGA : 0.109(02) (00) (03) | Likelihood : 0.086(02) (03) (03) | Testing efficiency compared to training efficiency (overtraining check) MVA Signal efficiency: from test sample (from training sample) Methods: @B= @B= @B=0.30 Fisher : (0.275) (0.658) (0.873) MLP : (0.278) (0.658) (0.873) LikelihoodD : (0.273) (0.657) (0.872) PDERS : (0.389) (0.691) (0.881) RuleFit : (0.198) (0.616) (0.848) HMatrix : (0.060) (0.623) (0.868) BDT : (0.268) (0.736) (0.911) CutsGA : (0.123) (0.424) (0.715) Likelihood : (0.092) (0.379) (0.677) Remark on overtraining Occurs when classifier training becomes sensitive to the events of the particular training sample, rather then just to the generic features Sensitivity to overtraining depends on classifier: e.g., Fisher insensitive, BDT very sensitive Detect overtraining: compare performance between training and test sample Counteract overtraining: e.g., smooth likelihood PDFs, prune decision trees, … Better classifier DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

More Evaluation Output
Input Variable Ranking --- Fisher : Ranking result (top variable is best ranked) --- Fisher : --- Fisher : Rank : Variable : Discr. power --- Fisher : 1 : var : 2.175e-01 --- Fisher : 2 : var : 1.718e-01 --- Fisher : 3 : var : 9.549e-02 --- Fisher : 4 : var : 2.841e-02 Better variable how useful is a variable? Classifier correlation and overlap --- Factory : Inter-MVA overlap matrix (signal): --- Factory : --- Factory : Likelihood Fisher --- Factory : Likelihood: --- Factory : Fisher: do classifiers perform the same separation into signal and background? If two classifiers have similar performance, but significant non-overlapping classifications  check if you can combine them! DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Graphical Evaluation Classifier output distributions for independent test sample: DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Graphical Evaluation There is no unique way to express the performance of a classifier  several benchmark quantities computed by TMVA Signal eff. at various background effs. (= 1 – rejection) when cutting on classifier output The Separation: “Rarity” implemented (background flat): Comparison of signal shapes between different classifiers Quick check: background on data should be flat DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Visualization Using the GUI
Projective likelihood PDFs, MLP training, BDTs, … average no. of nodes before/after pruning: 4193 / 968 DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Choosing a Working Point
Depending on the problem the user might want to Achieve a certain signal purity, signal efficiency, or background reduction, or Find the selection that results in the highest signal significance (depending on the expected signal and background statistics) Using the TMVA graphical output one can determine at which classifier output value he needs to cuts to separate signal from background DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Applying the Trained Classifier
Use the TMVA::Reader class, example in TMVApplication.C: Set input variables Book classifier with the weight file (contains all information) Compute classifier response inside event loop  use it Also standalone C++ class without ROOT dependence std::vector<std::string> inputVars; … classifier = new ReadMLP ( inputVars ); for (int i=0; i<nEv; i++) { std::vector<double> inputVec = …; double retval = classifier->GetMvaValue( *inputVec ); } from ClassApplication.C Templates TMVApplication.C ClassAplication.C available at $TMVA/macros/ and $ROOTSYS/tmva/test/ DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Extending TMVA A user might have an own implementation of a multivariate classifier, or wants to use an external one With ROOT (16.Jan.08) user can seamlessly evaluate and compare his own classifier within TMVA: Requirement: An own class must be derived from TMVA::MethodBase and must implement the TMVA::IMethod interface The class must be added to the factory via ROOT’s plugin mechanism Training, testing, evaluation, and comparison can then be done as usual, Example in TMVAnalysis.C DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

A Word on Treatment of Systematics?
There is no principle difference in systematics evaluation between single discriminating variables and MV classifiers Control sample to estimate uncertainty on classifier output (not necessarily for each input variable) Advantage: correlations automatically taken into account Some things could be done: Example: var4 may in reality have a shifted central value and hence a worse discrimination power One can: ignore the systematic in the training var4 appears stronger in training than it might be suboptimal performance (bad training, not wrong) Classifier response will strongly depend on “var4”, and hence will have a larger systematic uncertainty Better: Train with shifted (weakened) var4 Then evaluate systematic error on classifier output DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Conclusion Remarks Multivariate classifiers are no black boxes, we just need to understand them Cuts and Likelihood are transparent  if they perform use them In presence of correlations other classifiers are better Difficult to understand at any rate Enormous acceptance growth in recent decade in HEP TMVA provides means to train, evaluate, compare, and apply different classifiers TMVA also tries – through visualization – improve the understanding of the internals of each classifier Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. We thank in particular the CERN Summer students Matt Jachowski (Stanford) for the implementation of TMVA's new MLP neural network, Yair Mahalalel (Tel Aviv) for a significant improvement of PDERS, and Or Cohen for the development of the general classifier boosting, the Krakow student Andrzej Zemla and his supervisor Marcin Wolter for programming a powerful Support Vector Machine, as well as Rustem Ospanov for the development of a fast k-NN algorithm. We are grateful to Doug Applegate, Kregg Arms, René Brun and the ROOT team, Tancredi Carli, Zhiyi Liu, Elzbieta Richter-Was, Vincent Tisserand and Alexei Volk for helpful conversations. DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Outlook Primary development from this Summer: Generalized classifiers
Be able to boost or bag any classifier Combine any classifier with any other classifier using any combination of input variables in any phase space region 1. is ready – now in testing mode. To be deployed after upcoming ROOT release.

A Few Toy Examples

Checker Board Example Performance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiers After specific tuning, also SVM und MLP perform well Theoretical maximum DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Linear-, Cross-, Circular Correlations
Illustrate the behavior of linear and nonlinear classifiers Linear correlations (same for signal and background) Linear correlations (opposite for signal and background) Circular correlations (same for signal and background) DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Linear-, Cross-, Circular Correlations
Plot test-events weighted by classifier output (red: signal-like, blue: background-like) Linear correlations (same for signal and background) Cross-linear correlations (opposite for signal and background) Circular correlations (same for signal and background) Fisher MLP BDT PDERS Likelihood - D Likelihood DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Final Performance Background rejection versus signal efficiency curve: Circular Example Cross Example Linear Example DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

Additional Information

Stability with Respect to Irrelevant Variables
Toy example with 2 discriminating and 4 non-discriminating variables: use all discriminant variables in classifiers use only two discriminant variables in classifiers DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

TMVAnalysis.C Script for Training
void TMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" ); TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V"); TFile *input = TFile::Open("tmva_example.root"); factory->AddSignalTree ( (TTree*)input->Get("TreeS"), 1.0 ); factory->AddBackgroundTree ( (TTree*)input->Get("TreeB"), 1.0 ); factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); factory->AddVariable("var3", 'F'); factory->AddVariable("var4", 'F'); factory->PrepareTrainingAndTestTree("", "NSigTrain=3000:NBkgTrain=3000:SplitMode=Random:!V" ); factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" ); factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" ); factory->TrainAllMethods(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory; } create Factory give training/test trees register input variables select MVA methods train, test and evaluate DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

TMVApplication.C Script for Application
void TMVApplication( ) { TMVA::Reader *reader = new TMVA::Reader("!Color"); Float_t var1, var2, var3, var4; reader->AddVariable( "var1+var2", &var1 ); reader->AddVariable( "var1-var2", &var2 ); reader->AddVariable( "var3", &var3 ); reader->AddVariable( "var4", &var4 ); reader->BookMVA( "MLP classifier", "weights/MVAnalysis_MLP.weights.txt" ); TFile *input = TFile::Open("tmva_example.root"); TTree* theTree = (TTree*)input->Get("TreeS"); // … set branch addresses for user TTree for (Long64_t ievt=3000; ievt<theTree->GetEntries();ievt++) { theTree->GetEntry(ievt); var1 = userVar1 + userVar2; var2 = userVar1 - userVar2; var3 = userVar3; var4 = userVar4; Double_t out = reader->EvaluateMVA( "MLP classifier" ); // do something with it … } delete reader; } create Reader register the variables book classifier(s) prepare event loop compute input variables calculate classifier output DESY, Hamburg Multivariate Analysis with TMVA - Jörg Stelzer

DESY Computing Seminar Hamburg

Similar presentations

Presentation on theme: "DESY Computing Seminar Hamburg"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DESY Computing Seminar Hamburg

Similar presentations

Presentation on theme: "DESY Computing Seminar Hamburg"— Presentation transcript:

Similar presentations

About project

Feedback