Statistical Methods for Data Analysis Multivariate discriminators with TMVA Luca Lista INFN Napoli.

Slides:



Advertisements
Similar presentations
HEP Data Mining with TMVA
Advertisements

Statistical Methods for Data Analysis Modeling PDF’s with RooFit
Statistical Methods for Data Analysis Random numbers with ROOT and RooFit Luca Lista INFN Napoli.
Statistical Methods for Data Analysis Random number generators Luca Lista INFN Napoli.
Statistical Issues at LHCb Yuehong Xie University of Edinburgh (on behalf of the LHCb Collaboration) PHYSTAT-LHC Workshop CERN, Geneva June 27-29, 2007.
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.
Florida International University COP 4770 Introduction of Weka.
R for Classification Jennifer Broughton Shimadzu Research Laboratory Manchester, UK 2 nd May 2013.
Matthew Schwartz Harvard University March 24, Boost 2011.
TMVA – Toolkit for Multivariate Analysis
Searching for Single Top Using Decision Trees G. Watts (UW) For the DØ Collaboration 5/13/2005 – APSNW Particles I.
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
This week: overview on pattern recognition (related to machine learning)
TMVA Andreas Höcker (CERN) CERN meeting, Oct 6, 2006 Toolkit for Multivariate Data Analysis.
G. Cowan Lectures on Statistical Data Analysis Lecture 7 page 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem 2Random variables and.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
1 Helge Voss Nikhef 23 rd - 27 th April 2007TMVA Toolkit for Multivariate Data Analysis: ACAT 2007 TMVA Toolkit for Multivariate Data Analysis with ROOT.
Michigan REU Final Presentations, August 10, 2006Matt Jachowski 1 Multivariate Analysis, TMVA, and Artificial Neural Networks Matt Jachowski
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
The WinMine Toolkit Max Chickering. Build Statistical Models From Data Dependency Networks Bayesian Networks Local Distributions –Trees Multinomial /
TMVA Jörg Stelzer: Machine Learning withCHEP 2007, Victoria, Sep 5 Machine Learning Techniques for HEP Data Analysis with TMVA Toolkit for Multivariate.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
1 Helge Voss Genvea 28 th March 2007ROOT 2007: TMVA Toolkit for MultiVariate Analysis TMVA A Toolkit for MultiVariate Data Analysis with ROOT Andreas Höcker.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Statistical Methods for Data Analysis Introduction to the course Luca Lista INFN Napoli.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
פרקים נבחרים בפיסיקת החלקיקים אבנר סופר אביב
5/9/111 Update on TMVA J. Bouchet. 5/9/112 What changed background and signal have increased statistic to recall, signal are (Kpi) pairs taken from single.
Postgraduate Computing Lectures PAW 1 PAW: Physicist Analysis Workstation What is PAW? –A tool to display and manipulate data. Learning PAW –See ref. in.
Machine Learning (ML) with Weka Weka can classify data or approximate functions: choice of many algorithms.
1 CATPhys meeting, CERN, Mar 13, 2006 Andreas Höcker – TMVA – toolkit for parallel multivariate data analysis – Andreas Höcker (ATLAS), Helge Voss (LHCb),
Multivariate Classifiers or “Machine Learning” in TMVA
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Computacion Inteligente Least-Square Methods for System Identification.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
One framework for most common MVA-techniques, available in ROOT Have a common platform/interface for all MVA classification and regression-methods: Have.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Multivariate Data Analysis with TMVA4 Jan Therhaag ( * ) (University of Bonn) ICCMSE09 Rhodes, 29. September 2009 ( * ) On behalf of the present core developer.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Introduction to Machine Learning
Cluster Classification Studies with MVA techniques
iSTEP 2016 Tsinghua University, Beijing July 10-20, 2016
Tutorial on Statistics TRISEP School 27, 28 June 2016 Glen Cowan
Java package classes Java package classes.
Multivariate Data Analysis with TMVA
Multivariate Data Analysis
Multi-dimensional likelihood
Basic machine learning background with Python scikit-learn
Chapter 4: Making Decisions.
Predict House Sales Price
Advanced Analytics Using Enterprise Miner
Overview of Supervised Learning
Computer Science 210 Computer Organization
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Computing and Statistical Data Analysis Stat 5: Multivariate Methods
Toolkit for Multivariate Data Analysis Helge Voss, MPI-K Heidelberg
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Somi Jacob and Christian Bach
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Lecture 10 – Introduction to Weka
Multivariate Methods Berlin Chen
TerraFerMA A Suite of Multivariate Analysis Tools
Multivariate Methods Berlin Chen, 2005 References:
Statistical Methods for Data Analysis Multivariate discriminators with TMVA Luca Lista INFN Napoli.
Modeling IDS using hybrid intelligent systems
What is Artificial Intelligence?
Presentation transcript:

Statistical Methods for Data Analysis Multivariate discriminators with TMVA Luca Lista INFN Napoli

Statistical Methods for Data Analysis Purpose of TMVA Provide support with uniform interface to many Multivariate Analysis technologies: Rectangular cut optimization (binary splits) Projective likelihood estimation Multi-dimensional likelihood estimation (PDE range-search, k-NN) Linear and nonlinear discriminant analysis (H-Matrix, Fisher, FDA) Artificial neural networks (three different implementations) Support Vector Machine Boosted/bagged decision trees Predictive learning via rule ensembles (RuleFit) The package is integrated with ROOT distribution Helper tools for visualization provided Luca Lista Statistical Methods for Data Analysis

Variable preprocessing For each classifier, a variable set (optional, but default) preprocessing can be applied Variables can be normalized to a common range Linear transformation into: Uncorrelated variable set Principal components (projection along axes with maximum variance) Luca Lista Statistical Methods for Data Analysis

Statistical Methods for Data Analysis TMVA Factory All the main TMVA objects are managed via a factory object TFile out("tmvaOut.root", "RECREATE"); TMVA::Factory * factory = new TMVA::Factory("<JobName>",out,"<options>"); out is a ROOT writable file that will be filled by TMVA with histograms and trees JobName is the conventional name of the job Options allow: verbosity (“V=False”) colored text output (“Color=True”) Luca Lista Statistical Methods for Data Analysis

Specify training and test samples Input files can be specified as ROOT trees or ASCII files If signal and background are saved into different trees: TTree * sigTree = (TTree*)sigSrc->Get(“<SigTreeName>”); TTree * bkgTreeA = (TTree*)bkgSrc->Get(“<BkgTreeNameA>”); TTree * bkgTreeB = (TTree*)bkgSrc->Get(“<BkgTreeNameB>”); TTree * bkgTreeC = (TTree*)bkgSrc->Get(“<BkgTreeNameC>”); Double_t sigWeight = 1.0; Double_t bkgWeightA = 1.0, bkgWeightB = 1.0, bkgWeightC = 1.0; factory->AddSignalTree(sigTree, sigWeight); factory->AddBackgroundTree(bkgTreeA, bkgWeightA); factory->AddBackgroundTree(bkgTreeB, bkgWeightB); factory->AddBackgroundTree(bkgTreeC, bkgWeightC); Luca Lista Statistical Methods for Data Analysis

Alternative input specification Specify cuts to select signal and background events TCut supported (string cut, e.g. “signal=1”) E.g.: based on flags in the tree TTree * inputTree = (TTree*)src->Get(“TreeName”); TCut sigCut = ...; TCut bkgCut = ...; factory->SetInputTrees(inputTree, sigCut, bkgCut); Specify input from ASCII files: // first file line must be variable specification // in ROOT standards. E.g.: x/F:y/F:z/F:k/I // next lines ordered variable values TString sigFile(“signal.txt”); TString bkgFile(“background.txt”); Double_t sigWeight = 1.0, bkgWeight = 1.0; factory->SetInputTrees(sigFile, bkgFile, sigWeight, bkgWeght); Luca Lista Statistical Methods for Data Analysis

Selecting variable for MA Variables or their combination supported Using ROOT TFormula factory->AddVariable(“x”, ‘F’); factory->AddVariable(“y”, ‘F’); factory->AddVariable(“x+y+z”,‘F’); factory->AddVariable(“k”, ‘I’); Variable type specified with (optional) characted code: F=float or double; I=int, short, char; also unsigned Weights can be computed from variables in the tree: factory->SetWeightExpression(“<weightExpression>”); Normalization of a variable in the range [0, 1] can be specified with the Boolean option Normalise. Luca Lista Statistical Methods for Data Analysis

Statistical Methods for Data Analysis Prepare training data Data internally copied and split into a training tree and a test tree User can specify the size of both training and test samples TCut presel = ...; factory->PrepareTrainingAndTestTrees(presel, “<options>”); Options list Sample size can be specified via: NSigTrain=5000:NBkgTrain=5000:NSigTest=5000:NBkgTest=5000 Default (0) means: all (remaining) events taken SplitMode specifies how to extract trainig and sample (Block; Alternate; Random, setting seed with SplitSeed=123456) Luca Lista Statistical Methods for Data Analysis

Statistical Methods for Data Analysis Booking classifiers Different classifiers can run and be compared within the same TMVA job Classifiers should be booked in advance, specifying their configuration in the option string factory->BookMethod(TMVA::Types::kLikelihood, “LikelihoodD”, “H:!TransformOutput:Spline=2:\ NSMooth=5:Preprocess=Decorrelate”); Specific options for each classifier exist Luca Lista Statistical Methods for Data Analysis

Train and test classifiers All classifiers can be trained at once factory->TrainAllMethods(); After training, tests can run and be saved to output file for visualization factory->TestAllMethods(); Performance evaluation (efficiencies, ecc.) can be done afterwards: factory->EvaluateAllMethods(); Luca Lista Statistical Methods for Data Analysis

Apply your trained classifiers Instantiate TMVA reader: TMVA::Reader * reader = new TMVA::Reader(); Define the input variables The same and in the same order as for the training! Float_t a, b, c; reader->AddVariable(“a”, &a); reader->AddVariable(“b”, &b); reader->AddVariable(“c”, &c); Book classifiers, reading output weight files reader->BookMVA(“<classifierName>”, “weights.txt”); Evaluate classifiers given the variable set a = 1.234; b = 1.000; c = 10.00; Double r = reader->EvaluateMVA(“<classifierName>”); Luca Lista Statistical Methods for Data Analysis

Classifier ranking in TMVA Luca Lista Statistical Methods for Data Analysis

Statistical Methods for Data Analysis TMVA GUI macro TMVAGui.C comes with TMVA distribution From ROOT prompt: > .L TMVAGui.C > TMVAGui(“myFile.root”) Click on the desired plot option Luca Lista Statistical Methods for Data Analysis

Statistical Methods for Data Analysis References TMVA User Guide CERN-OPEN-2007-007 arXiv physics/0703039 TMVA http://tmva.sourceforge.net/ Luca Lista Statistical Methods for Data Analysis