Download presentation
Presentation is loading. Please wait.
1
DESY Summer Student Programme 2017
Optimal Variable Selection for Machine Learning Analysis in ATLAS ttH Search DESY Summer Student Programme 2017 Sitong An Supervisor: Judith Katzy, Paul Glaysher DESY ATLAS Internal Presentation 05/09/2017
2
Introduction ttH analysis SM Background with complex final states
Difficult to establish cuts on single variable Signal: ttH | Background: tt+jets. Same final state particles. Selection: 6 jets with >=3 (Very tight). Pt_jets > 25GeV | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
3
Introduction ttH analysis SM Background with complex end products
Difficult to establish cuts on single variable Signal: ttH | Background: tt+jets. Same final state particles. Selection: 6jets with >=3 Machine Learning techniques Boosted Decision Trees (BDT) | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
4
Introduction Machine Learning techniques Boosted Decision Trees (BDT)
“Boosted”: Each new tree is trained with more emphasis on previously misclassified training data Weighted sum of individual BDT vote to give a final BDT score | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
5
Motivation More variables -> More information -> Better separating power Goal: Reduce no. of variables used with least impact on separating power & produce a ranking of variables i.e. How to select the best N variables to use in the training? Why? We use Monte Carlo data to establish discrimination Need to check the validity and potential bias of each variable Reduce No. of variables used -> Less effort on checks and reduce systematics Remove information redundancy Reduce training time Potentially provide insight to the physical importance of the few selected variables | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
6
Setup This investigation: Data: Monte Carlo simulated
Hyperparameters This investigation: Data: Monte Carlo simulated Default variable list from ICHEP paper (Summer 2016) Framework: Toolkit for Multivariable Data Analysis with ROOT Nvariables = 21 Ntrees = 400 MinNodeSize=4% MaxDepth=5 nCuts=80 BoostType=AdaBoost | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
7
Setup Initial Variable List - ICHEP Summer 2016
| ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
8
Performance Benchmark
How to evaluate the performance of each subset of variables for BDT? Receiver Operating Characteristic (ROC) Curve Background rejection versus Signal efficiency Use the area under curve as an indication of performance This is analysis-specific: Other performance benchmark possible | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
9
Idea 1: Visual inspection of S/B separation
Intuitively, variables with less Signal/Background overlap should have more separating power. Signal and background events normalised to similar magnitude in these plots | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
10
Idea 1: Visual inspection of S/B separation
We selected these 8 variables out of 21, based on their apparent separating power: dEtajj_MaxdEta NHiggs_30 Njet_pt40 Centrality_all semilepMVAreco_bbhiggs_dR dRbb_avg dRbb_MaxPt pT_jet5 | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
11
Idea 1: Visual inspection of S/B separation
We selected these 8 variables out of 21, based on their apparent separating power: | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
12
Idea 1: Visual inspection of S/B separation
Complementary set of 13 apparently “less useful” variables: | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
13
Idea 1: Visual inspection of S/B separation
Results (ROC Integral): With all variables: With 8 ‘useful’ variables: With the other 13 ‘useless’ variables: It seems that No. of variables matter more than which variables we use Non-intuitive More importantly: We selected these 21 variables based on their apparent separating power Does not seem to be a good method for variable selection in the first place What if there are unused variables that are better for training the BDT for ttH analysis, but just did not look good to us? | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
14
Idea 2: Remove variable from pairs with strong correlation
Stronger correlation pair -> More redundancy in information Remove one variable from the pair with maximal correlation Intuitively, least reduction in the information content | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
15
Idea 2: Remove variable from pairs with strong correlation
Correlation Matrix | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
16
Idea 2: Remove variable from pairs with strong correlation
Results Correlation-based Subtractive Iteration Start with all 21 vars Iteratively reduce No. of vars Reference Baseline: Random Variable Sets | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
17
Idea 3: TMVA out-of-the-box ranking
Widely known to be unreliable (TMVA author’s opinion) “A ranking of the BDT input variables is derived by counting how often the variables are used to split decision tree nodes, and by weighting each split occurrence by the separation gain- squared it has achieved and by the number of events in the node.” Ranking produced after one training run with all 21 variables | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
18
Idea 3: TMVA out-of-the-box ranking
Results TMVA Ranking Take the top N variables according to TMVA Ranking Train and test the performance | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
19
Idea 3: TMVA out-of-the-box ranking
TMVA Rerank Subtractive iteration Update the TMVA Ranking after removing each variable Random Tweaks Take TMVA Ranking Variable List with 6 vars Randomly take one out & Randomly put one in | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
20
Idea 3: TMVA out-of-the-box ranking
Random tweaks on TMVA ranking selection shows that it is definitely not the optimal selection We don’t think it takes into account of correlation/mutual information Attempts to improve TMVA Ranking Instead of using the ranking produced when training with all 21 variables, update the ranking at each No. of variables. This should reflect correlation relationship. …But it didn’t. Conclusion: TMVA Ranking is a good reference/starting point, but not optimal and not stable. | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
21
Idea 4: Iterative Removal
At each step, test removal of each possible variables, eventually removing the one with the least impact in performance and repeat | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
22
Idea 4: Iterative Removal
Results Iterative Removal Subtractive Iteration Start with 21 variables Test all 21 possibilities of removing each variable (21 var lists with Nvar=20) Select the var list with best performance (variable removed has least impact) Repeat | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
23
Idea 4: Iterative Removal
Results Iterative Removal Not Optimal | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
24
Idea 4: Iterative Removal
A hill-climbing search in a 21-dimension hyperspace, with each dimension being binary (0 or 1) Most obvious idea, but resource hungry (computing time) Easily stuck in local maxima – not optimal | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
25
Idea 4: Iterative Removal
A hill-climbing search in a 21-dimension hyperspace, with each dimension being binary (0 or 1) Most obvious idea, but resource hungry (computing time) Easily stuck in local maxima – not optimal What if instead of taking the best performing variable list each time, we take the top N? | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
26
Idea 5: Beam Search (with width N)
What if instead of taking the best performing variable list each time, we take the top N? | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
27
Idea 5: Beam Search (with width N)
Results Beam Search (w=10) Found a better solution here than iterative Rm. Better solution is found with greater width But more computationally expensive | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
28
Idea 6: Random Walk Introduce more randomness in the variable selection process could potentially kick the search out of local maxima Could make use of TMVA Ranking as a starting point (utilize easily obtainable prior information) | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
29
Idea 6: Random Walk Random Walk Black: TMVA Ranking as comparison
Results Random Walk Black: TMVA Ranking as comparison Red: Envelope of Random Walk Algo. Performance Start from TMVA Ranking Achieved comparable performance to Itr. Rm. After 9 Iterations | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
30
Idea 6: Random Walk Results Random Walk Algorithm asymptotically approaches optimal selection Unlike previous methods, computation cost does not scale with total No. of variables in the pool Good check on previous results obtained | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
31
Best Variables Top Ten Variables as Selected by Different Variables from ICHEP Summer 2016 Variable List. (Unranked) Beam Search w=5 Iterative Removal Remove-one Var Static TMVA Ranking (Static) TMVA Reranking H1_all HT_jets semilepMVAreco_BDT_withH_output Centrality_all NHiggs_30 dRbb_avg Mbb_MindR Aplan_jets Mbj_MaxPt pT_jet5 Njet_pt40 dRbb_MaxPt dEtajj_MaxdEta Mjj_MindR semilepMVAreco_higgsbhadtop_withH_dR | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
32
In need of more colours All Test Runs on ICHEP Summer 2016 Variable Lists | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
33
Validation: Alternative Sets of Variables
New Variable List (21 Variables) semilepMVAreco_Ncombinations HT_all dRbj_Wmass Mbj_Wmass Mbj_MindR dRHl_MaxdR dRlj_MindR pT_jet3 dRbb_MaxM dRjj_min H4_all Aplan_bjets Mjjj_MaxPt Mbb_MaxM Mjj_MinM semilepMVAreco_higgslep_dR semilepMVAreco_leptophadtop_with_dR semilepMVAreco_b1higgsbhadtop_dR semilepMVAreco_ttH_Ht_with semilepMVAreco_BDT_output semilepMVAreco_higgsbleptop_with_dR | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
34
Validation: Alternative Sets of Variables
Results | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
35
vSearch: Parallelised Python Script for Variable Selection
Iterative Removal of 21 Vars ~ 400 Jobs. Error-prone. “Launch and forget” script for variable selection Automatic training and testing of alternative variable lists Automatic selection of best variable lists with predefined search strategies Output human-readable. Analysis and plotting script included. Designed to run on BIRD Cluster in parallel to save running time Designed to be highly modular and as generic as possible: Different search strategies (Itr.Rm. / Bm.Sch. / Rdm.Walk) Different performance benchmark (Don’t like ROC Integral?) Different physical process (Successfully applied to SUSY search | ref: Emily’s earlier talk) Working on proper documentation and interface | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
36
Selection on Combined Variable Lists (42 Variables)
Results | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
37
Final Recommended Selection for ttH Search
Combined Variable Lists (42 Variables) AUROC with all 42 variables: 75.0 AUROC with 10 top variables: 74.2 Performance drop of 1% Most of the variables contain redundant information! Top 10 Recommended Variables. dRbb_avg semilepMVAreco_BDT_withH_output HT_jets NHiggs_30 Centrality_all H1_all Mbb_MindR Aplan_jets semilepMVAreco_Ncombinations semilepMVAreco_ttH_Ht_withH | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
38
Conclusion Variable selection is a non-trivial problem in machine learning community We have even less experience of this in HEP Important for physics search with machine learning TMVA ranking is a good guidance but sub-optimal Iterative Removal is good enough for most purposes Use Beam Search/Random Walk to push the limit of performance (and your computing grid) Use vSearch to automate search process | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
39
Question? DESY ATLAS Internal Presentation 05/09/2017
40
Backup
41
Heading Agenda Subheading, optional 01 Heading Copy
Copy derferecus mint esequiam Erum volum quibeaque Uatus alis velluptatem nihit pe 02 Heading Copy Nimintis et, iderum eture, que natur resecto volorepudae laborum 03 Heading Copy Onestia voluptae vendant pos quatet Bea dolorrum endam quas 04 Heading Copy Dolorrum endam Ipsum est 05 Heading Copy Copy derferecus mint esequiam Erum volum quibeaque Uatus alis velluptatem nihit pe 06 Heading Copy Nimintis et, iderum eture, que natur resecto volorepudae laborum 07 Heading Copy Onestia voluptae vendant pos quatet Bea dolorrum endam quas 08 Heading Copy Dolorrum endam Ipsum est | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
42
Intro ttH analysis Complex Background Difficult to establish cuts
Machine Learning techniques Boosted Decision Trees (BDT)
43
Subtitle of Presentation
Presentation Title Subtitle of Presentation Name Surname City, Date
44
Subtitle of Chapter, optional
Chapter Title Subtitle of Chapter, optional
45
Heading Subheading, optional Heading Copy
Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat Erum volum quibeaque ea voleste mporibeat aut eos esequi tor aturem Uatus alis velluptatem nihit pe ne susa am aut aut volorep eressi dolupta nonet Nimintis et, iderum eture, que natur resecto volorepudae laborum Onestia voluptae vendant pos quatet bea dolorrum endam quas sum aut Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat Erum volum quibeaque ea voleste mporibeat aut eos esequi tor aturem Uatus alis velluptatem nihit pe ne susa am aut aut volorep eressi dolupta nonet Nimintis et, iderum eture, que natur resecto volorepudae laborum Onestia voluptae vendant pos quatet bea dolorrum endam quas sum aut | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
46
Heading Subheading, optional Heading Copy
Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat Erum volum quibeaque ea voleste mporibeat aut eos esequi tor aturem Uatus alis velluptatem nihit pe ne susa am aut aut volorep eressi dolupta nonet Nimintis et, iderum eture, que natur resecto volorepudae laborum Onestia voluptae vendant pos quatet bea dolorrum endam quas sum aut Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat Erum volum quibeaque ea voleste mporibeat aut eos esequi tor aturem Uatus alis velluptatem nihit pe ne susa am aut aut volorep eressi dolupta nonet Nimintis et, iderum eture, que natur resecto volorepudae laborum Onestia voluptae vendant pos quatet bea dolorrum endam quas sum aut Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat Erum volum quibeaque ea voleste mporibeat aut eos esequi tor aturem Uatus alis velluptatem nihit pe ne susa am aut aut volorep eressi dolupta nonet Nimintis et, iderum eture, que natur resecto volorepudae laborum Onestia voluptae vendant pos quatet bea dolorrum endam quas sum aut | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
47
Heading Subheading, optional Heading Copy
Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es. Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es. | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
48
Heading Subheading, optional Heading Copy
Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es quae sed enihilit as autestiatur. Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es quae sed enihilit as autestiatur. | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
49
Heading Subheading, optional Heading Copy
Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es quae sed enihilit as autestiatur. Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es quae sed enihilit as autestiatur. | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
50
Heading Subheading, optional Heading Copy
Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es quae sed enihilit as autestiatur. Heading Copy Copy derferecus mint esequiam corepelenet aute dolesti aerorio minctotat. Essunt qui aut ipsamusandae sunt am ut officiatus maxim quo molesti oriatii ssedit, untiunt odit volentotat. Xeris voloressi aut et molo es quae sed enihilit as autestiatur. | ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
51
Heading Subheading, optional
| ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
52
Heading Subheading, optional
| ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
53
Heading Subheading, optional
| ttH BDT Variable Selection | Sitong An 04/09/2017 | ATLAS Internal
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.