Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper1 Statistical Software In DØ The Good, the Bad and the Non-Existent Harrison B. Prosper.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Signal/Background Discrimination Harrison B. Prosper SAMSI, March Signal/Background Discrimination in Particle Physics Harrison B. Prosper Florida.
Introduction to Monte Carlo Markov chain (MCMC) methods
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
S.Towers TerraFerMA TerraFerMA A Suite of Multivariate Analysis tools Sherry Towers SUNY-SB Version 1.0 has been released! useable by anyone with access.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
What is Statistical Modeling
Searching for Single Top Using Decision Trees G. Watts (UW) For the DØ Collaboration 5/13/2005 – APSNW Particles I.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Statistical Methods Chichang Jou Tamkang University.
Statistical Tools PhyStat Workshop 2004 Harrison B. Prosper1 Statistical Tools A Few Comments Harrison B. Prosper Florida State University PHYSTAT Workshop.
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Bayesian Neural Networks Pushpa Bhat Fermilab Harrison Prosper Florida State University.
Machine Learning CMPT 726 Simon Fraser University
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Multivariate Analysis A Unified Perspective
G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 41 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
30th September 2005ROOT2005 Workshop 1 Developments in other math and statistical classes Anna Kreshuk, PH/SFT, CERN.
Harrison B. Prosper Workshop on Top Physics, Grenoble Bayesian Statistics in Analysis Harrison B. Prosper Florida State University Workshop on Top Physics:
Irakli Chakaberia Final Examination April 28, 2014.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
October 19, 2000ACAT 2000, Fermilab, Suman B. Beri Top Quark Mass Measurements Using Neural Networks Suman B. Beri, Rajwant Kaur Panjab University, India.
BCS547 Neural Decoding.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Optimization by Model Fitting Chapter 9 Luke, Essentials of Metaheuristics, 2011 Byung-Hyun Ha R1.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Multivariate Methods in Particle Physics Today and Tomorrow Harrison B. Prosper Florida State University 5 November, 2008 ACAT 08, Erice, Sicily.
From Small-N to Large Harrison B. Prosper SCMA IV, June Bayesian Methods in Particle Physics: From Small-N to Large Harrison B. Prosper Florida State.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Single Top Quark Production at D0, L. Li (UC Riverside) EPS 2007, July Liang Li University of California, Riverside On Behalf of the DØ Collaboration.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Canadian Bioinformatics Workshops
Confidence Intervals Lecture 2 First ICFA Instrumentation School/Workshop At Morelia, Mexico, November 18-29, 2002 Harrison B. Prosper Florida State University.
Developments in other math and statistical classes
Bayesian Within The Gates A View From Particle Physics
Multivariate Analysis Past, Present and Future
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Discrete Event Simulation - 4
Multidimensional Integration Part I
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Mathematical Foundations of BME
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
Presentation transcript:

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper1 Statistical Software In DØ The Good, the Bad and the Non-Existent Harrison B. Prosper Florida State University PHYSTAT Workshop August 2005

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper2 Outline  Analysis Example  Available Software  Wish List  Summary

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper3 Example - DØ Single Top Group – I

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper4 Example - DØ Single Top Group – II  Search for p+pbar → t + (q) + b + X  8 signal channels  7 background sources per signal channel  QCD, ttbar(lj), ttbar(ll), Wjj, Wbb, WW, WZ  Each data bin is the sum of  tb, tqb, QCD, ttbar(lj), ttbar(ll), Wjj, Wbb, WW, WZ

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper5 Example - DØ Single Top Group – III  Basic Statistical Quantity: Binned Likelihood  Goal  To measure  s and  t

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper6 Example – Statistical Problems – IV  Background Modeling  Model/data comparisons in multiple dimensions to determine region with “best” match.  Background events are generally weighted, for example, by the probability that it could contain a b-jet.  These “tag-rate functions” are the results of fits to 2 – 3 dimensional empirical densities.

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper7 Example – Statistical Problems – V  Discriminant Variable Selection  From a list of potentially useful variables, select the “best” sub-set.  Multivariate Analyses  Random Grid Search  Neural Networks  Decision Trees  Bayesian Neural Networks

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper8 Example – Statistical Problems – VI  Posterior Density Computation  Must marginalize over hundreds of variables (acceptances and background yields) and must do so taking into account known dependencies.  Analysis Validation  Ideally, the entire analysis is run repeatedly on fake data-sets to study its frequency behavior.

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper9 Available Software – I  Fitting  Minuit applied to Root histograms  PoissonGammaFit (more later!)  Multivariate Methods  RGSearch(a few incompatible versions)  Jetnet (v3.4) (with C++ binding  MLPfit(several versions)  oo_neural (OOP version of BP)

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper10 Available Software – II  Classifier (decision tree)  C2.4 (decision tree)  TerraFerma (misc. methods)  BNN(Bayesian NN)  Limit Setting  top_statistics(Bayes, CLs)  blimit(Bayes – more robust version of DØ web-calculator)

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper11 Available Software – III  Adaptive Numerical Integration  AdBayes(C++ binding of Alan Genz’s Fortran code)  Python Bindings  RGSearch, Jetnet, AdBayes, PoissonGammaFit, CLHEP, Coin, Root, etc.

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper12 PoissonGammaFit  Model  For each bin i we write the (mean) data count d i as a linear sum of N (mean) source counts  Likelihood for observed distribution D ={D i }

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper13 PoissonGammaFit – II  Bayesian Inference for Moments m r of p  Prior (given source counts A ji )

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper14 What’s Available?  C++ Class  PoissonGammaFit (vvdouble& A, vdouble&D, stringprior=“flat”, bool scale=true, inttotal=10000)  Methods  m = o.mean()  v = o.variance() vdouble= vector vvdouble = vector >

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper15 What’s Available? – II  Main Program Usage: pgammafit-h -f [hist-file-list (histfile.list)] -n [# of sampling points (10000)] -o [name of plot (pgammafit.gif)]  Uses  HistogramCache, PoissonGammaFit, Minuit

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper16

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper17 Bayesian Neural Networks y(x,w) x1x1 x2x2 u, a v, b w = (u, a, v, b) weights For binary (0,1) classification p(1|x) y(x,w) → p(1|x)

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper18 HT_AllJets_MinusBestJets Dots p(1|H T ) = H tqb /(H tqb +H Wbb ) H is a 1-D histogram Curves individual NNs y(H T, w n ) Black curve Bayesian Neural Networks – II

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper19

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper20 What’s Available?  Radford Neal’s Package  C-codes compiled and linked into a set of programs:  net-specSpecify network  data-specSpecify training data  net-genInitialize network  mc-specSpecify MCMC parameters  net-mcRun MCMC  net-displayDisplay network parameters  netwrite.pyWrite results to a C++ function

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper21 The Bad  It’s a Jungle Out There!  Difficult to express ideas clearly  Tools typically cannot be moved, easily, from one framework to another  No clear protocol for interface between heterogeneous data formats  No algebra of histograms  Histograms tightly coupled to their viewers: Use Root or die!

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper22 The Bad – II  Inadequate Support For:  Generating ensembles of observations, possibly with conditioning, to study bias, variance, coverage etc.  Assessing robustness with respect to likelihoods and prior densities  Studying different confidence limit procedures  Studying different optimization criteria

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper23 The Non-Existent  DØ has no (or inadequate) tools to:  Browse data in truly interesting ways  Perform goodness-of-fit tests that go beyond KS and χ 2  Construct Bayesian models, systematically  Perform sensitivity analyses, systematically  No domain-specific language

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper24 Wish List – I  Free At Last!  Statistical tool separate from, and independent of, the environment in which it might be used.  However, provide bindings for different environments/languages (R, Root, Ruby, Python, Java, etc.)  Less Is More!  Each statistical tool should encapsulate a single coherent statistical idea.

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper25 Wish List – II  Histograms  Histogram and histogram viewers should be independent of each other. (A sensible idea from Marc Paterno!)  Elegant algebra of histograms h = a*h 1 +b*h 2 /h 3 etc.  Powerful, intuitive tools for multi-dim. data exploration

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper26 Wish List – III  Likelihoods  Flexible method for reporting them; perhaps as swarms of points generated via MCMC?  Frequency Methods  Flexible ensemble generator, with easily extracted sub-ensembles  Flexible query of ensembles (to get coverage, error rates, variances, bias etc.)

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper27 Wish List – IV  Bayesian Methods  Flexible robustness studies (prior family, likelihood family etc.)  Multi-dimensional integration (adaptive and Markov Chain MC)  Domain Specific Language  No dereferencing, auto_ptr, dynamic_cast, pointers, templates etc. please,… we’re British!

Statistical Tools In Dzero PHYSTAT Workshop 2005 Harrison B. Prosper28 Summary  The Good  Many statistical tools are in use at DØ  A lot more needed – opportunity for creativity!  The Bad  Current tools are a reflection of non-interacting idiosyncratic minds!  The Non-Existent  Lack of a domain-specific language for expression of statistical ideas. I don’t want to think about pointers and const-correctness when I’m trying to think about mathematics.