Using Random Forests to Classify W + W - and tt events J. Lovelace Rainbolt, Thoth Gunter, Michael Schmitt CIERA Pizza Discussion Oct 20, 2014 20-Oct-20141Random.

Slides:



Advertisements
Similar presentations
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Advertisements

Evaluating Classifiers
Imbalanced data David Kauchak CS 451 – Fall 2013.
Lecture 22: Evaluation April 24, 2010.
Physics 114: Lecture 7 Uncertainties in Measurement Dale E. Gary NJIT Physics Department.
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Visible and Invisible Higgs at 350 GeV Mark Thomson University of Cambridge =+
Summary of Results and Projected Sensitivity The Lonesome Top Quark Aran Garcia-Bellido, University of Washington Single Top Quark Production By observing.
Searching for Single Top Using Decision Trees G. Watts (UW) For the DØ Collaboration 5/13/2005 – APSNW Particles I.
1 N. Davidson E/p single hadron energy scale check with minimum bias events Jet Note 8 Meeting 15 th May 2007.
Top Physics at the Tevatron Mike Arov (Louisiana Tech University) for D0 and CDF Collaborations 1.
The new Silicon detector at RunIIb Tevatron II: the world’s highest energy collider What’s new?  Data will be collected from 5 to 15 fb -1 at  s=1.96.
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Lecture 5 (Classification with Decision Trees)
Sept , 2005M. Block, Phystat 05, Oxford PHYSTAT 05 - Oxford 12th - 15th September 2005 Statistical problems in Particle Physics, Astrophysics and.
Ensemble Learning (2), Tree and Forest
 We cannot use a two-sample t-test for paired data because paired data come from samples that are not independently chosen. If we know the data are paired,
Today Evaluation Measures Accuracy Significance Testing
1 Emergency Infant Feeding Surveys Assessing infant feeding as a component of emergency nutrition surveys: Feasibility studies from Algeria, Bangladesh.
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
LECTURE 19 THURSDAY, 14 April STA 291 Spring
1 ZH Analysis Yambazi Banda, Tomas Lastovicka Oxford SiD Collaboration Meeting
Bug Localization with Machine Learning Techniques Wujie Zheng
H → ZZ →  A promising new channel for high Higgs mass Sara Bolognesi – Torino INFN and University Higgs meeting 23 Sept – CMS Week.
1 A Preliminary Model Independent Study of the Reaction pp  qqWW  qq ℓ qq at CMS  Gianluca CERMINARA (SUMMER STUDENT)  MUON group.
Claudia-Elisabeth Wulz Institute for High Energy Physics Vienna Level-1 Trigger Menu Working Group CERN, 9 November 2000 Global Trigger Overview.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Matthew Schwartz Harvard University with J. Gallicchio, PRL, 105:022001,2010 (superstructure) with K. Black, J. Gallicchio, J. Huth, M. Kagan and B. Tweedie.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Hadronic Event Shapes at 7 TeV with CMS Detector S,Banerjee, G. Majumdar, MG + ETH, Zurich CMS PAS QCD M. Guchait DAE-BRNS XIX High Energy Physics.
Ensemble Learning  Which of the two options increases your chances of having a good grade on the exam? –Solving the test individually –Solving the test.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a measure of the population. This value is typically unknown. (µ, σ, and now.
Measurements of Top Quark Properties at Run II of the Tevatron Erich W.Varnes University of Arizona for the CDF and DØ Collaborations International Workshop.
PCB 3043L - General Ecology Data Analysis.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Ensemble Methods in Machine Learning
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Top mass error predictions with variable JES for projected luminosities Joshua Qualls Centre College Mentor: Michael Wang.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
A statistic from a random sample or randomized experiment is a random variable. The probability distribution of this random variable is called its sampling.
A search for the ZZ signal in the 3 lepton channel Azeddine Kasmi Robert Kehoe Southern Methodist University Thanks to: H. Ma, M. Aharrouche.
Sampling Distributions Chapter 18. Sampling Distributions If we could take every possible sample of the same size (n) from a population, we would create.
University of Iowa Study of qq->qqH  qq ZZ Alexi Mestvirishvili November 2004, CMS PRS Workshop at FNAL.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Viktor Veszpremi Purdue University, CDF Collaboration Tev4LHC Workshop, Oct , Fermilab ZH->vvbb results from CDF.
Search for Standard Model Higgs in ZH  l + l  bb channel at DØ Shaohua Fu Fermilab For the DØ Collaboration DPF 2006, Oct. 29 – Nov. 3 Honolulu, Hawaii.
1 Donatella Lucchesi July 22, 2010 Standard Model High Mass Higgs Searches at CDF Donatella Lucchesi For the CDF Collaboration University and INFN of Padova.
Uncertainties in Measurement Laboratory investigations involve taking measurements of physical quantities. All measurements will involve some degree of.
Michael Carrigan Advisors: Kevin Black & Lidia Dell’Asta
The NP class. NP-completeness
Elizabeth R McMahon 14 April 2017
CHAPTER 8 Estimating with Confidence
First Evidence for Electroweak Single Top Quark Production
Saisai Gong, Wei Hu, Yuzhong Qu
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
LECTURE 05: THRESHOLD DECODING
Chapter 25: Paired Samples and Blocks
Plans for checking hadronic energy
Analyzing the Association Between Categorical Variables
12/12/ A Binomial Random Variables.
An introduction to Machine Learning (ML)
Presentation transcript:

Using Random Forests to Classify W + W - and tt events J. Lovelace Rainbolt, Thoth Gunter, Michael Schmitt CIERA Pizza Discussion Oct 20, Oct-20141Random Forests

20-Oct-2014Random Forests2 Today I will tell you about a particle physics problem. Why? Because we are using a technique used by some people in astronomy (and cosmology). Our results pertain to a test case, not an actual result. This is a step toward grander things (including some astrophysics)…

20-Oct-2014Random Forests3

20-Oct-2014Random Forests4 Current measurements of the W + W - cross section are mildly discrepant. In principle this could be signs of new physics. Also: the W + W - process is the main, irreducible background for an important Higgs decay mode.

20-Oct-2014Random Forests5 image: blog.yhathq.com

20-Oct-2014Random Forests6 color? shape? size? red fir red maple pine tree birch bush redgreen pointyround tall medium small There is nothing particularly good about this particular tree. It may have good, fair or poor performance. One can easily come up with other trees with more or less equivalent performance. A computer algorithm can come up with these trees.

20-Oct-2014Random Forests7 source: Andrew Zisserman (Oxford)

20-Oct-2014Random Forests8 source: Andrew Zisserman (Oxford)

20-Oct-2014Random Forests9 source: Andrew Zisserman (Oxford)

20-Oct-2014Random Forests10 The input data are randomized tree by tree. “bagging” The features presented to a given tree are a random subset of the set of all features. The arrangement of nodes is random. A random forest is a collection of binary decision trees.

20-Oct-2014Random Forests11 source: Andrew Zisserman (Oxford) Take a suitable average of the results from all the trees.

20-Oct-2014Random Forests12

20-Oct-2014Random Forests13 Output of the Random Forest (Y) signalmain background minor background Declare a minimum value of Y to select a sample of events. We can select whatever cut suits our purpose.

20-Oct-2014Random Forests14 The Standard Cuts This is a relatively simple analysis. CMS Collab., Phys. Lett. B (2013)

20-Oct-2014Random Forests15 ROC (Receiver Operator Curve) (true positives) (false positives) better

20-Oct-2014Random Forests16 Cross section (what we want to measure) Figure of Merit: relative statistical uncertainty on the cross section N selected WW and tt events acceptance efficiency luminosity dependent on the cut on Y: independent of the cut on Y: dependent on the cut on Y:  WW and  tt depend on Y cut

20-Oct-2014Random Forests17 Y cut a factor two better!

20-Oct-2014Random Forests18 Recall that the current values for  WW are “too high.” Jet rates are difficult to calculate precisely… is this causing trouble?

20-Oct-2014Random Forests19

20-Oct-2014Random Forests20 It would be great to see the N jet distribution for WW only. Surprisingly, we can, without incurring a major loss in statistical power. The RF makes use of the N jet feature, for sure, but it is much gentler on the distribution. (Since at least some trees do not use N jet then at least some events with N jet > 0 will be selected.) In this particular case, we could do a good measurement without hardly changing the N jet distribution at all!

20-Oct-2014Random Forests21 before selection after selection tt is greatly reduced, but WW is hardly changed.

20-Oct-2014Random Forests22 Other tests of the robustness of the RF We shifted the missing transverse energy (MET) – a crucial feature – by 10% in the training but not in the test set. The impact on the efficiency  WW (aka the true positive rate) was quite small (5%). This is quite small given the extreme shift of 10%. The MET calibration is good to 2% or better. We suppressed all information relating to MET for a randomly selected 10% of the events. For a normal analysis, this means an immediate loss of 10%. With the RF, however, the loss is only 2.4%. The implication is that incomplete data are still useful, which could have major consequences in particle physics and more so in astronomy & astrophysics.

Thank You! 20-Oct-2014Random Forests23 J. Lovelace Rainbolt, Thoth Gunter, Michael Schmitt