Cluster Classification Studies with MVA techniques Motivation: Current EMFracClassification tool uses 75 TProfile 2D plots as “lookup tables”. Why not apply simple cuts? Maybe more sophisticated MVA discrimination techniques (Likelihood, ANN, ...) Why use the two cluster moments <ϱ> and _clus? Improvement of the efficiency and purity of the classification. Used as test analysis for developing a toolkit for multi variate analyses TMVA. (see http://tmva.sf.net). TMVA integration in ROOT is about to finish this week
Data set The basis of the cluster classification studies are the postrome single pions with calibration hits: http://menke.home.cern.ch/menke/cgi-bin/hec/postrome.sh Created same data sets with electrons/positrons with same software and scripts: (would be on castor already, but my grid certificate expired) # events per generated single particle energy : energy distribution for all clusters:
Which clusters are from the electron or pion? In an empty calorimeter one expects up to 12 clusters from noise in addition to the clusters from the generated single particle Take only clusters wich contain energy from calibration hits (true G4) clusters in pion sample: clusters in electron sample:
Definition of the classification samples Strategy: “Try to find the EM clusters first, and apply weights to the rest” EM clusters are the “signal” Defition of “EM clusters” EM_frac (from calibration hits) > 0.9 (not tuned yet)
Cluster moments (2 < eta <2.2; 4 < E < 16 GeV)
Cluster moments 2/3
Cluster moments 3/3 There are many cluster moments already calculated by default Some look pretty promising! Try to find out “best variable” or “best variable set” using automatic cut optimisation technique
Method of Cut Optimisation “Optimal cuts” maximise the signal efficiency at given background efficiency. The result is (in this case) a set of 100 cuts corresponding the signal efficiency from 0 to 1. Each cut set has a corresponding background rejection efficiency. For the application afterwards one has to choose one working point. Technically, optimisation is achieved in TMVA by Monte Carlo generation using uniform priors for the lower cut value, and the cut width, thrown within the variable ranges.
Example for Cut Optimisation Take the two variables from EmfractTool: <ϱ> and _clus Run cut optimisation: EMFracTool would be just one point in this plot
Finding the best set of variables Strategy: Run cut optimisation for all combinations of 2 (3,4,5) moments out of the 16. Compare the resulting efficiencies at background rejection of 99% (high purity) This is done for more than 1000 combinations in to bins 0.2 < |eta| <0.4 4 < E_clus < 16 GeV 2.0 < |eta| <2.2 4 < E_clus < 16 GeV
“optimal” Set of Variables (i) 0.2 < |eta| <0.4 4 < E_clus < 16 GeV: --- MVA Signal efficiency: --- Methods: @B=0.01 @B=0.10 @B=0.30 --- Cuts_278 : 0.681 0.940 0.986 --- Cuts_279 : 0.671 0.939 0.987 --- Cuts_27 : 0.671 0.939 0.986 --- Cuts_27c : 0.671 0.938 0.986 --- Cuts_8c : 0.668 0.915 0.985 --- Cuts_289 : 0.667 0.936 0.987 --- Cuts_28a : 0.666 0.936 0.986 --- Cuts_27a : 0.663 0.938 0.986 --- Cuts_27b : 0.661 0.939 0.986 --- Cuts_270 : 0.654 0.941 0.987 --- Cuts_8a : 0.651 0.936 0.985 --- Cuts_28c : 0.644 0.935 0.986 --- Cuts_280 : 0.644 0.929 0.986 The name “cut_xyz” is a short cut for cutting on three variables (x,y,z) 0 = "cl_m2_r_topo" 1 = "cl_m2_lambda_topo" 2 = "cl_center_lambda_topo" 3 = "cl_lateral_topo" 4 = "cl_center_x_topo" 5 = "cl_longitudinal_topo" 6 = "cl_lateral_topo" 7 = "cl_m1_dens_topo" 8 = "cl_m2_dens_topo" 9 = "cl_center_Y_topo" a = "cl_delta_theta_topo" b = "cl_center_z_topo" c = "cl_eng_frac_max_topo"
“optimal” Set of Variables (ii) The name “cut_xyz” is a short cut for cutting on three variables (x,y,z) 0 = "cl_m2_r_topo" 1 = "cl_m2_lambda_topo" 2 = "cl_center_lambda_topo" 3 = "cl_lateral_topo" 4 = "cl_center_x_topo" 5 = "cl_longitudinal_topo" 6 = "cl_lateral_topo" 7 = "cl_m1_dens_topo" 8 = "cl_m2_dens_topo" 9 = "cl_center_Y_topo" a = "cl_delta_theta_topo" b = "cl_center_z_topo" c = "cl_eng_frac_max_topo" 0.2 < |eta| <0.4 4 < E_clus < 16 GeV: --- MVA Signal efficiency: --- Methods: @B=0.01 @B=0.10 @B=0.30 --- Cuts_25c : 0.568 0.891 0.980 --- Cuts_258 : 0.566 0.892 0.979 --- Cuts_25b : 0.556 0.890 0.980 --- Cuts_256 : 0.554 0.891 0.980 --- Cuts_5bc : 0.553 0.893 0.980 --- Cuts_25 : 0.539 0.896 0.979 --- Cuts_257 : 0.539 0.894 0.980 --- Cuts_25a : 0.539 0.894 0.980 --- Cuts_25c : 0.538 0.892 0.980 --- Cuts_5b : 0.534 0.896 0.980 --- Cuts_7b : 0.533 0.930 0.985 --- Cuts_278 : 0.533 0.930 0.985 --- Cuts_27 : 0.533 0.929 0.985 --- Cuts_279 : 0.533 0.928 0.984
“optimal” Set of Variables (iii) The optimal set of varibles seems to be eta (energy?) dependend. The most prominent variables are: center_lambda, m2_dens, longitudinal, frac_em Needs further investigation. Let TMVA use these four variables and let's try some other discrimination techniques: --- TMVA_Factory: Evaluation results ranked by best 'signal eff @B=0.01' --------------------------------------------------------------------------- --- MVA Signal efficiency: Signifi- Sepa- mu-Trans- --- Methods: @B=0.01 @B=0.10 @B=0.30 cance: ration: form: --- TMlpANN : 0.604 0.934 0.988 2.331 0.770 0.841 --- Cuts : 0.554 0.924 0.983 0.000 0.000 0.000 --- Likelihood : 0.472 0.893 0.990 1.670 0.693 0.938 --- BDTGini : 0.393 0.914 0.981 2.115 0.719 0.898 --- PDERS : 0.345 0.858 0.976 1.998 0.685 0.780 --- Fisher : 0.194 0.790 0.981 1.355 0.538 0.798 --- TMVA_MethodFisher: ranked output (top variable is best ranked) ---------------------------------------------------------------- --- Variable : Coefficient: Discr. power: --- cl_m1_dens_topo: +2.877 0.4517 --- cl_center_lambda_topo: -2.796 0.3710 --- cl_eng_frac_em_topo: - 0.039 0.3436 --- cl_longitudinal_topo: -1.206 0.2722
Summary & Outlook The rectengular cut method is really competitive method for cluster classification Optimal cuts are calculated for each efficiency/background -> need to choose working point Optimal sets of cuts for all bins of E and eta are currently being calculated. ->Then decide which variables to use finally Use TMVA_Reader (ROOT class) other method not yet fully tuned... Since at least one variable is perfectly discriminating one has to remove this variable and do a training on the remaining variables on top of it
Code Example: Do the training in 72 bins! // load data sets TString datFileS = "data/e.dat"; TString datFileB = "data/pi.dat"; Tmva_factory->SetInputTrees( datFileS, datFileB ); // which variables are used for discrimination inputVars->push_back("cl_m2_r_topo"); inputVars->push_back("cl_m2_lambda_topo"); inputVars->push_back("cl_delta_phi_topo"); tmva_factory->SetInputVariables( inputVars );
// split data set and do training for EACH bin! Double_t etaBins[25]={ 0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 4.2, 4.4, 4.6, 4.8}; Double_t energyBins[6] = { 0.0000, 4000, 16000, 64000, 200000, 40000000 }; tmva_factory->BookMultipleMVAs("cl_e_topo", 5, &energyBins[0] ); Tmva_factory->BookMultipleMVAs("cl_m1_eta_topo", 24, &etaBins[0] ); // choose method inputVars->push_back("cl_m2_r_topo"); tmva_factory->BookMethod( "MethodCuts", "V:MC:500000:AllFSmart" ); tmva_ factory->TrainAllMethods(); tmva_factory->TestAllMethods();
Code Example: Apply Classification in Athena // create TMVA_Reader object TMVA_Reader *tmva = new TMVA_Reader( inputVars ); tmva->BookMultipleMVAs("cl_e_topo", 5, &energyBins[0] ); tmva->BookMultipleMVAs("cl_m1_eta_topo", 24, &etaBins[0] ); tmva->BookMVA( TMVA_Reader::LikeLiHood, “myweightfile" ); double mvaLKD = tmva->EvaluateMVA( varValues, multicutValues, TMVA_Reader::LikelihoodD );