Maria Grazia Pia, INFN Genova A Toolkit for Statistical Data Analysis M.G. Pia S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino, A. Pfeiffer,

Slides:



Advertisements
Similar presentations
Alberto Ribon CERN Geant4Workshop Vancouver, September 2003 Tutorial of the Statistical Toolkit
Advertisements

Statistical Toolkit Power of Goodness-of-Fit tests
Maria Grazia Pia, INFN Genova Test & Analysis Project Maria Grazia Pia, INFN Genova on behalf of the T&A team
Maria Grazia Pia, INFN Genova Statistical Testing Project Maria Grazia Pia, INFN Genova on behalf of the Statistical Testing Team
Maria Grazia Pia, INFN Genova Geant4 Physics Validation (mostly electromagnetic, but also hadronic…) K. Amako, S. Guatelli, V. Ivanchenko, M. Maire, B.
Maria Grazia Pia Geant4 LowE Workshop 30-31/5/2002 ow Energy e.m. Workshop CERN, May 2002.
Simulation of X-ray Fluorescence and Application to Planetary Astrophysics A. Mantero, M. Bavdaz, A. Owens, A. Peacock, M. G. Pia IEEE NSS -- Portland,
Maria Grazia Pia, INFN Genova Atomic Relaxation Models A. Mantero, B. Mascialino, Maria Grazia Pia INFN Genova, Italy P. Nieminen ESA/ESTEC
Maria Grazia Pia, INFN Genova 1 Part V The lesson learned Summary and conclusions.
Geant4-Genova Group Validation of Susanna Guatelli, Alfonso Mantero, Barbara Mascialino, Maria Grazia Pia, Valentina Zampichelli INFN Genova, Italy IEEE.
Barbara MascialinoIEEE-NSSOctober 21 th, 2004 Application of statistical methods for the comparison of data distributions Susanna Guatelli, Barbara Mascialino,
Automated Analysis and Code Generation for Domain-Specific Models George Edwards Center for Systems and Software Engineering University of Southern California.
Barbara Mascialino, INFN Genova An update on the Goodness of Fit Statistical Toolkit B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo
Maria Grazia Pia, INFN Genova Test & Analysis Project Maria Grazia Pia, INFN Genova on behalf of the T&A team
Maria Grazia Pia, INFN Genova CERN, 26 July 2004 Background of the Project.
1 M.G. Pia et al. The application of GEANT4 simulation code for brachytherapy treatment Maria Grazia Pia INFN Genova, Italy and CERN/IT
Maria Grazia Pia, INFN Genova Low Energy Electromagnetic Physics Maria Grazia Pia INFN Genova
Validation of the Bremsstrahlung models Susanna Guatelli, Barbara Mascialino, Luciano Pandola, Maria Grazia Pia, Pedro Rodrigues, Andreia Trindade IEEE.
Geant4-INFN (Genova-LNS) Team Validation of Geant4 electromagnetic and hadronic models against proton data Validation of Geant4 electromagnetic and hadronic.
Maria Grazia Pia Systematic validation of Geant4 electromagnetic and hadronic models against proton data Systematic validation of Geant4 electromagnetic.
M obile C omputing G roup A quick-and-dirty tutorial on the chi2 test for goodness-of-fit testing.
Comparison of data distributions: the power of Goodness-of-Fit Tests
Maria Grazia Pia, INFN Genova Software Process: Physics Maria Grazia Pia INFN Genova on behalf of the Geant4 Collaboration Budker Inst. of Physics IHEP.
Michela Piergentili, INFN Genova F. P. Brooks, “No Silver Bullet - Essence and Accidents of Software Engineering”, IEEE Computer 20(4):10-19, April, 1987.
M.G. Pia et al. Brachytherapy at IST Results from an atypical Comparison Project Stefano Agostinelli 1,2, Franca Foppiano 1, Stefania Garelli 1, Matteo.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
IEEE Nuclear Science Symposium and Medical Imaging Conference Short Course The Geant4 Simulation Toolkit Sunanda Banerjee (Saha Inst. Nucl. Phys., Kolkata,
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
IEEE Nuclear Science Symposium and Medical Imaging Conference Short Course The Geant4 Simulation Toolkit Sunanda Banerjee (Saha Inst. Nucl. Phys., Kolkata,
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
Maria Grazia Pia, INFN Genova Test & Analysis Project aka “statistical testing” Maria Grazia Pia, INFN Genova on behalf of the T&A team
Provide tools for the statistical comparison of distributions  equivalent reference distributions  experimental measurements  data from reference sources.
Alberto Ribon, CERN Statistical Testing Project Alberto Ribon, CERN on behalf of the Statistical Testing Team CLHEP Workshop CERN, 28 January 2003.
Maria Grazia Pia, INFN Genova Statistical Toolkit Recent updates M.G. Pia B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo
Susanna Guatelli & Barbara Mascialino G.A.P. Cirrone (INFN LNS), G. Cuttone (INFN LNS), S. Donadio (INFN,Genova), S. Guatelli (INFN Genova), M. Maire (LAPP),
Geant4 Space User Workshop 2004 Maria Grazia Pia, INFN Genova Proposal of a Space Radiation Environment Generator interfaced to Geant4 S. Guatelli 1, P.
Computing Performance Recommendations #13, #14. Recommendation #13 (1/3) We recommend providing a simple mechanism for users to turn off “irrelevant”
IEEE Nuclear Science Symposium and Medical Imaging Conference Short Course The Geant4 Simulation Toolkit Sunanda Banerjee (Saha Inst. Nucl. Phys., Kolkata,
Tests of Random Number Generators
An update on the Statistical Toolkit Barbara Mascialino, Maria Grazia Pia, Andreas Pfeiffer, Alberto Ribon, Paolo Viarengo July 19 th, 2005.
Maria Grazia Pia, INFN Genova Update on the Goodness of Fit Toolkit M.G. Pia B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo
Statistical Methods for Data Analysis Introduction to the course Luca Lista INFN Napoli.
Precision analysis of Geant4 condensed transport effects on energy deposition in detectors M. Batič 1,2, G. Hoff 1,3, M. G. Pia 1 1 INFN Sezione di Genova,
Geant4 Training 2006 Short Course Katsuya Amako (KEK) Gabriele Cosmo (CERN) Susanna Guatelli (INFN Genova) Aatos Heikkinen (Helsinki Institute of Physics)
Maria Grazia Pia, INFN Genova Statistics Toolkit Project Maria Grazia Pia, INFN Genova AIDA Workshop.
Physics Data Libraries: Content and Algorithms for Improved Monte Carlo Simulation Physics data libraries play an important role in Monte Carlo simulation:
The Statistical Testing Project Stefania Donadio and Barbara Mascialino January 15 TH, 2003.
Barbara MascialinoMonte Carlo 2005Chattanooga, April 19 th 2005 Monte Carlo Chattanooga, April 2005 B. Mascialino, A. Pfeiffer, M. G. Pia, A. Ribon,
LCG – AA review 1 Simulation LCG/AA review Sept 2006.
Summary of HEP SW workshop Ian Bird MB 15 th April 2014.
Susanna Guatelli Geant4 in a Distributed Computing Environment S. Guatelli 1, P. Mendez Lorenzo 2, J. Moscicki 2, M.G. Pia 1 1. INFN Genova, Italy, 2.
Geant4 Training 2004 Short Course Katsuya Amako (KEK) Gabriele Cosmo (CERN) Giuseppe Daquino (CERN) Susanna Guatelli (INFN Genova) Aatos Heikkinen (Helsinki.
Maria Grazia Pia, INFN Genova and CERN1 Geant4 highlights of relevance for medical physics applications Maria Grazia Pia INFN Genova and CERN.
Maria Grazia Pia, INFN Genova - G4 WG Coord. Meeting, 13/11/2001 ow Energy Electromagnetic Physics ow Energy Electromagnetic Physics New physics features.
Maria Grazia Pia Geant4 Workshop Lisbon, October 2006 M.G. Pia INFN Genova Experience with Geant4 training.
Proposal of Geant4 Physics Book
Update on the Goodness of Fit Toolkit
A Statistical Toolkit for Data Analysis
Data analysis in HEP: a statistical toolkit
B.Mascialino, A.Pfeiffer, M.G.Pia, A.Ribon, P.Viarengo
Introductory Course PTB, Braunschweig, June 2009
Short Course Siena, 5-6 October 2006
An update on the Goodness of Fit Statistical Toolkit
Introductory Course ORNL, May 2008
Short Course IEEE NSS/MIC 2003 Katsuya Amako (KEK) Makoto Asai (SLAC)
Validating a Random Number Generator
Statistical Testing Project
G. A. P. Cirrone1, G. Cuttone1, F. Di Rosa1, S. Guatelli1, A
Comparison of data distributions: the power of Goodness-of-Fit Tests
Data analysis in HEP: a statistical toolkit
Presentation transcript:

Maria Grazia Pia, INFN Genova A Toolkit for Statistical Data Analysis M.G. Pia S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo LCG Application Area Meeting CERN, 5 May 2004

Maria Grazia Pia, INFN Genova History and background

Maria Grazia Pia, INFN Genova The motivation from Geant4 Validation of Geant4 physics models through comparison of simulation vs experimental data or reference databases Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation ESA Bepi Colombo mission to Mercury Test beam at Bessy Photon attenuation coefficient, Al Geant4 Standard Geant4 LowE NIST Electromagnetic models in Geant4 w.r.t. NIST reference

Maria Grazia Pia, INFN Genova Historical introduction to EDF tests empirical distribution function enquired how close this would be to the true distribution In 1933 Kolmogorov published a short, but landmark paper on the Italian Giornale dell’Istituto degli Attuari. He formally defined the empirical distribution function (EDF) and then enquired how close this would be to the true distribution F(x), when this is continuous. interesting probability problem statistical methodology. It must be noticed that Kolmogorov himself regarded his paper as the solution of an interesting probability problem, following the general interest of the time, rather than a paper on statistical methodology. foundations Smirnov, Cramer, Von Mises, Anderson, Darling After Kolmogorov article, over a period of about 10 years, the foundations were laid by a number of distinguished mathematicians of methods of testing fit to a distribution based on the EDF ( Smirnov, Cramer, Von Mises, Anderson, Darling, …). continues with great strength today The ideas in this paper have formed a platform for vast literature, both of interesting and important probability problems, and also concerning methods of using the Kolmogorov statistics for testing fit to a distribution. The literature production continues with great strength today showing no sign to decrease.

Maria Grazia Pia, INFN Genova Typical use cases in HEP Regression testing –Throughout the software life-cycle Online DAQ –Monitoring detector behaviour w.r.t. a reference Simulation validation –Comparison with experimental data Reconstruction –Comparison of reconstructed vs. expected distributions Physics analysis –Comparisons of experimental distributions (ATLAS vs. CMS Higgs?) –Comparison with theoretical distributions (data vs. Standard Model)

Maria Grazia Pia, INFN Genova Software tools Commercial products used by “professional” statisticians –SPSS, NCSS... In HEP: A lot of activity: –workshops/conferences (CERN, Durham, SLAC etc.) –books (F. James et al., L. Lyons, R. Barlow etc.) –sophisticated statistical algorithms applied in various data analyses...but, in spite of the relevant role played by statistics in HEP, very limited availability of software tools for statistics in our field –and in open-source software in general

Maria Grazia Pia, INFN Genova Let’s do it ourselves... Provide tools for the statistical comparison of distributions Create a hub to aggregate expertise and collaborative contributions from scientists interested in statistical methods A project to develop an open-source software system for statistical analysis A project to develop an open-source software system for statistical analysis see presentation at LCG-AA meeting, 27 November 2002

Maria Grazia Pia, INFN Genova Vision: the basics software process Rigorous software process vision Have a vision for the project –General purpose tool for statistical analysis –Toolkit approach (choice open to users) –Open source product architecture Build on a solid architecture Clearly define scopeobjectives scope, objectives Flexible, extensible, maintainable Flexible, extensible, maintainable system quality Software quality

Maria Grazia Pia, INFN Genova Architectural guidelines architectural The project adopts a solid architectural approach functionalityquality –to offer the functionality and the quality needed by the users maintainable –to be maintainable over a large time scale extensible –to be extensible, to accommodate future evolutions of the requirements Component-based architecture –to facilitate re-use and integration in diverse frameworksDependencies –adopt a standard (AIDA) for the user layer –no dependence on any specific analysis toolPython –the “glue” for interactivity LCG Architecture Blueprint Report The approach adopted is compatible with the recommendations of the LCG Architecture Blueprint Report

Maria Grazia Pia, INFN Genova Software process United Software Development Process, specifically tailored to the project –practical guidance and tools from the RUP –both rigorous and lightweight –mapping onto ISO –significant experience gained in the group from other projects Incremental and iterative life-cycle model

Maria Grazia Pia, INFN Genova The Goodness-of-Fit component

Maria Grazia Pia, INFN Genova User Requirements User requirementselicitedanalysedformally specified User requirements elicited, analysed and formally specified –Functional (capability) and not-functional (constraint) requirements –User Requirements Document available from the web site Requirements Design Implementation Test & test results Documentation Requirement traceability

Maria Grazia Pia, INFN Genova

Simple user layer Shields the user from the complexity of the underlying algorithms and design AIDA objectscomparison algorithm Only deal with AIDA objects and choice of comparison algorithm

Maria Grazia Pia, INFN Genova GoF algorithms Algorithms for binned distributions – Anderson-Darling test – Chi-squared test – Fisz-Cramer-von Mises test – Tiku test (Cramer-von Mises test in chi-squared approximation) Algorithms for unbinned distributions – Anderson-Darling test – Fisz-Cramer-von Mises test – Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) – Kolmogorov-Smirnov test – Kuiper test – Tiku test (Cramer-von Mises test in chi-squared approximation)

Maria Grazia Pia, INFN Genova Chi-squared test Applies to binned distributions It can be useful also in case of unbinned distributions, but the data must be grouped into classes Cannot be applied if the counting of the theoretical frequencies in each class is < 5 –When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached

Maria Grazia Pia, INFN Genova EMPIRICAL DISTRIBUTION FUNCTION ORIGINAL DISTRIBUTIONS Kolmogorov-Smirnov test Goodman approximation of KS test Kuiper test D mn Unbinned distributions SUPREMUM STATISTICS More sophisticated algorithms

Maria Grazia Pia, INFN Genova Cramer-von Mises test Anderson-Darling test Fisz-Cramer-von Mises test k-sample Anderson-Darling test Unbinned distributions Binned distributions TESTS CONTAINING A WEIGHTING FUNCTION More powerful algorithms

Maria Grazia Pia, INFN Genova Anderson-DarlingHighSensitive to tails 22 LowGeneral Fisz-Cramer-von MisesHighSymmetric, right-skewed distributions GoodmanMedium Approximation of K-S to  2 test statistics Kolmogorov-SmirnovMediumDerives from Kolmogorov statistics KuiperMediumSensitive to tails and median TikuHighConverts CvM statistics to a chi2 TestPowerCharacteristics More about a comparative evaluation of tests in the User Documentation on our web Topic still subject to research activity in the domain of statistics Comparative documentation of tests

Maria Grazia Pia, INFN Genova  2 loses information in a test for unbinned distribution by grouping the data into cells  Kac, Kiefer and Wolfowitz (1955) showed that Kolmogorov- Smirnov test requires n 4/5 observations compared to n observations for  2 to attain the same power Cramer-von Mises and Anderson-Darling statistics are expected to be superior to Kolmogorov-Smirnov’s, since they make a comparison of the two distributions all along the range of x, rather than looking for a marked difference at one point 2222 2222 Supremum statistics tests Tests containing a weight function < < The power of a test is the probability of rejecting the null hypothesis correctly In terms of power: Power of tests

Maria Grazia Pia, INFN Genova

Unit test:  2 (1) EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)  2 test-statistics = 15.8 Expected  2 = 15.8 Exact p-value= Expected p-value= Months The study concerns monthly birth and death distributions (binned data)

Maria Grazia Pia, INFN Genova Unit test:  2 (2) EXAMPLE FROM CRAMER BOOK (MATHEMATICAL METHODS OF STATISTICS - page 447) The study concerns the sex distribution of children born in Sweden in 1935  2 test-statistics = Expected  2 = Exact p-value=0 Expected p-value=0

Maria Grazia Pia, INFN Genova Unit test: K-S Goodman (1) EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)  2 test-statistics = 3.9 Expected  2 = 3.9 Exact p-value= Expected p-value= Months The study concerns monthly birth and death distributions (unbinned data) Cumulative Function

Maria Grazia Pia, INFN Genova Unit test: K-S Goodman (2)  2 test-statistics = 1.5 Expected  2 = 1.5 EXAMPLE FROM LANDENNA BOOK (NONPARAMETRIC TESTS BASED ON FREQUENCIES - page 287) We consider body lengths of two independent groups of anopheles Exact p-value= Expected p-value= Body lengths

Maria Grazia Pia, INFN Genova Unit test: Kolmogorov-Smirnov(1) EXAMPLE FROM D test-statistics = Expected D = Exact p-value= Expected p-value=0.035 The study concerns how long a bee stays near a particular tree (Redwell/Whitney) Cumulative

Maria Grazia Pia, INFN Genova Unit test: Kolmogorov-Smirnov (2) EXAMPLE FROM LANDENNA BOOK (NONPARAMETRIC STATISTICAL METHODS - page ) We consider one clinical parameter of two independent groups of patients D test-statistics = 0.65 Expected D = 0.65 Exact p-value= Expected p-value= Cumulative

Maria Grazia Pia, INFN Genova Example of application results Anderson-Darling A c (95%) =0.752 Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation ESA Bepi Colombo mission to Mercury test beam at Bessy Photon attenuation coefficient, Al Geant4 Standard Geant4 LowE NIST  2 N-L =13.1 – =20 p=0.87  2 N-S =23.2 – =15 p=0.08 Electromagnetic models in Geant4 w.r.t. NIST reference

Maria Grazia Pia, INFN Genova Latest release: 30 March 2004 GPL License

Maria Grazia Pia, INFN Genova User Documentation Download Installation User Guide Statistics Reference Guide

Maria Grazia Pia, INFN Genova A toolkit for modeling multi-parametric fit problems F. Fabozzi, L. Lista INFN Napoli Initially developed while rewriting a fortran fitter for BaBar analysis – Simultaneous estimate of: B(B   J/   ) / B(B   J/  K  ) direct CP asymmetry – More control on the code was needed to justify a bias appeared in the original fitter

Maria Grazia Pia, INFN Genova Requirements Provide Tools for modeling parametric fit problems Unbinned Maximum Likelihood (UML [*] ) fit of: –PDF parameters –Yields of different sub-samples –Both, mixed  2 fits Toy Monte Carlo to study the fit properties –Fitted parameter distributions Pulls, Bias, Confidence level of fit results [*] not Unified Modeling Language … … New components included in the Statistical Toolkit Architecture open to extension and evolution

Maria Grazia Pia, INFN Genova For LCG users The Statistical Toolkit is distributed with PI as an external product –Currently the previous release - not the latest yet - is distributed –Update foreseen Integration in the Savannah system for problem reporting foreseen Open to collaboration to facilitate the usage in the LGC community –feedback, user requirements, suggestions are welcome, of course! Please contact for further information about the Statistical Toolkit in PI

Maria Grazia Pia, INFN Genova References Conference Proceedings: –PhyStat Conference, SLAC, 2003 –IEEE Nuclear Science Symposium, Portland, 2003 Papers: –S. Donadio et al., A toolkit for statistical data comparison To be published in IEEE Trans. Nucl. Sci. (August 2004) More papers in preparation References kept up-to-date on the web site

Maria Grazia Pia, INFN Genova Will be moved to a new area out of Geant4-INFN web (automatic re-direction)

Maria Grazia Pia, INFN Genova Acknowledgments Work supported and partially funded by the European Space Agency (ESA) under Contract No.16339/02/NL/FM Geant4 beta testing –P. Cirrone (INFN-LNS), S. Guatelli (INFN Genova), S. Parlati (INFN-LNGS) Fred James (CERN) and Louis Lyons (Oxford) –many useful suggestions, discussions, encouragement...

Maria Grazia Pia, INFN Genova Conclusions A project to develop an open source, general purpose software toolkit for statistical data analysis is in progress –to provide a product of common interest to user communities Rigorous software process –to contribute to the quality of the product Component-based architecture, OO methods + generic programming –to ensure openness to evolution, maintainability, ease of use GoF component Component for modeling multi-parametric fit problems Software released and results available –toolkit in use for Geant4 physics validation –incremental and iterative life-cycle