Barbara Mascialino, INFN Genova An update on the Goodness of Fit Statistical Toolkit B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo th Geant4 Space Users’ Workshop
Barbara Mascialino, INFN Genova Goodness of Fit testing Regression testing –Throughout the software life-cycle Online DAQ –Monitoring detector behaviour w.r.t. a reference Simulation validation –Comparison with experimental data Reconstruction –Comparison of reconstructed vs. expected distributions Physics analysis –Comparison with theoretical distributions –Comparisons of experimental distributions Goodness-of-fitmathematical foundation comparison of data distributions Goodness-of-fit testing is the mathematical foundation for the comparison of data distributions THEORETICAL DISTRIBUTION SAMPLE ONE-SAMPLE PROBLEM SAMPLE 2SAMPLE 1 TWO-SAMPLE PROBLEM Use cases in experimental physics
Barbara Mascialino, INFN Genova G.A.P Cirrone, S. Donadio, S. Guatelli, A. Mantero, B. Mascialino, S. Parlati, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo “A Goodness-of-Fit Statistical Toolkit” IEEE- Transactions on Nuclear Science (2004), 51 (5): B. Mascialino, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo “New developments of the Goodness-of-Fit Statistical Toolkit” IEEE- Transactions on Nuclear Science (2006), 53 (6), to be published
Barbara Mascialino, INFN Genova Software process guidelines Adopt a process – software quality Unified Processtailored Unified Process, specifically tailored to the project RUP –practical guidance and tools from the RUP –both rigorous and lightweight –mapping onto ISO (and CMM) Incremental and iterative life-cycle 1 st cycle: 2-sample GoF tests –1-sample GoF in preparation
Barbara Mascialino, INFN Genova Architectural guidelines architectural The project adopts a solid architectural approach functionalityquality –to offer the functionality and the quality needed by the users maintainable –to be maintainable over a large time scale extensible –to be extensible, to accommodate future evolutions of the requirements Component-based architecture –to facilitate re-use and integration in diverse frameworks –layer architecture pattern –core component for statistical computation –independent components for interface to user analysis environmentsDependencies –no dependence on any specific analysis tool –can be used by any analysis tools, or together with any analysis tools –offer a (HEP) standard (AIDA) for the user layer
Barbara Mascialino, INFN Genova
The algorithms are specialised on the kind of distribution (binned/unbinned)
Barbara Mascialino, INFN Genova GoF algorithms in the Statistical Toolkit Unbinned distributions – Anderson-Darling test – Anderson-Darling approximated test – Cramer-von Mises test – Generalised Girone test – Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) – Kolmogorov-Smirnov test – Kuiper test – Tiku test (Cramer-von Mises test in chi-squared approximation) – Weighted Kolmogorov-Smirnov test (2 flavours) – Weighted Cramer-von Mises test TWO-SAMPLE PROBLEM Binned distributions – Anderson-Darling test – Anderson-Darling approximated test – Chi-squared test – Fisz-Cramer-von Mises test – Tiku test (Cramer-von Mises test in chi-squared approximation) It is the most complete software for the comparison of two distributions, even among commercial/professional statistics tools. It provides all 2-sample (edf) GoF algorithms existing in statistics literature
Barbara Mascialino, INFN Genova Simple user layer Shields the user from the complexity of the underlying algorithms and design analysis objects comparison algorithm Only deal with the user’s analysis objects and choice of comparison algorithm First release: user layer for AIDA analysis objects –LCG Architecture Blueprint, Geant4 requirement Second release: added user layer for ROOT analysis objects –in response to user requirements User Layer
Barbara Mascialino, INFN Genova Which test to use? Do we really need such a wide collection of GoF tests? Why? most appropriate Which is the most appropriate test to compare two distributions? good How “good” is a test at recognizing real equivalent distributions and rejecting fake ones? The choice of the most suitable GoF test can be performed on the basis of two different criteria: –Computational performance –Statistical performance (power) Which test to use?
Barbara Mascialino, INFN Genova AVERAGE CPU TIME Binned Distributions Unbinned Distributions Anderson-Darling (0.69±0.01) ms(16.9±0.2) ms Anderson-Darling (approximated) (0.60±0.01) ms(16.1±0.2) ms Chi-squared (0.55±0.01) ms Cramer-von Mises (0.44±0.01) ms(16.3±0.2) ms Generalised Girone (15.9±0.2) ms Goodman (11.9±0.1) ms Kolmogorov-Smirnov (8.9±0.1) ms Kuiper (12.1±0.1) ms Tiku (0.69±0.01) ms(16.7±0.2) ms Watson (14.2±0.1) ms Weighted Kolmogorov-Smirnov (AD) (14.0±0.1) ms Weighted Kolmogorov-Smirnov (Buning) (14.0±0.1) ms Weighted Cramer-von Mises (14.0±0.1) ms A) Performance of the GoF tests
Barbara Mascialino, INFN Genova B) Power of GoF tests Systematicall Systematic study of all existing GoF tests in progress –made possible by the extensive collection of tests in the Statistical Toolkit –GoF tests power evaluated in a variety of alternative situations considered clear winner: No clear winner: the statistical performance of a test depends on the features of the distributions to be compared (skewness and tailweight) and on the sample size Practical recommendations 1)first classify the type of the distributions in terms of skewness and tailweight 2)choose the most appropriate test given the type of distributions evaluating the best test by means of the quantitative model proposed Topic still subject to research activity in the domain of statistics p< General recipe The power of a test is the probability of rejecting the null hypothesis correctly
Barbara Mascialino, INFN Genova Examples of practical applications
Barbara Mascialino, INFN Genova Statistical Toolkit Usage Geant4 physics validation –rigorous approach: quantitative evaluation of Geant4 physics models with respect to established reference data –see for instance: K. Amako et al., Comparison of Geant4 electromagnetic physics models against the NIST reference data IEEE Trans. Nucl. Sci (2005) LCG Simulation Validation project –see for instance: A. Ribon, Testing Geant4 with a simplified calorimeter setup, CMS –validation of “new” histograms w.r.t. “reference” ones in OSCAR Validation Suite Usage also in space science, medicine, statistics, etc.
Barbara Mascialino, INFN Genova Electron Stopping Power H 0 REJECTION AREA p-value stability study centre Experimental set-up Physics models under test: Geant4 Standard Geant4 Low Energy – Livermore Geant4 Low Energy – Penelope Reference data: NIST ESTAR - ICRU 37 p-value Z Geant4 LowE Penelope Geant4 Standard Geant4 LowE EEDL NIST - XCOM The three Geant4 models are equivalent Geant4 LowE Penelope Geant4 Standard Geant4 LowE EEDL Validation of Geant4 e.m. physics models vs. NIST reference data 2 test (to include data uncertainties in the computation of the test statistics value)
Barbara Mascialino, INFN Genova Validation of Geant4 Atomic Relaxation vs NIST reference dataShell-end Kolmogorov- Smirnov D p-value Geant4 ○ NIST Fluorescence - Shell-start 3
Barbara Mascialino, INFN Genova Validation of Geant4 electromagnetic and hadronic models against proton data Low Energy EM – ICRU49: p, ions Low Energy EM – Livermore: , e- Standard EM:e+ HadronElastic with BertiniElastic Bertini Inelastic p-value CvMKSAD Left branch Right branch Whole curve LowE EM – ICRU49 BertiniElastic CvM Cramer-von Mises test KS Kolmogorov-Smirnov test AD Anderson-Darling test 0.5 M events mm Geant4 Experimental data Bertini Inelastic
Barbara Mascialino, INFN Genova 2 not appropriate (< 5 entries in some bins, physical information would be lost if rebinned) Anderson-Darling A c (95%) =0.752 Test beam at Bessy Bepi-Colombo mission Energy (keV) Counts X-ray fluorescence spectrum in Iceand basalt (E IN =6.5 keV) Very complex distributions Experimental measurements are comparable with Geant4 simulations
Barbara Mascialino, INFN Genova Average energy deposit (MeV) Depth in the phantom (cm) Average energy deposit of GCR p GCR p 4 cm Al - Binary set 4 cm Al - Bertini 10 cm water - Binary 10 cm water – EM 4 cm Al – EM 10 cm water - Bertini Comparison of alternative vehicle concepts in human missions to Mars Reference: rigid structures as in the ISS (2 - 4 cm Al) Kolmogorov-Smirnov test Multi-layer + 10 cm water equivalent to 4 cm Al Multi-layer + 5 cm water equivalent to 2.15 cm Al An inflatable habitat exhibits a shielding capability equivalent to a conventional rigid one Shielding material Energy deposited in phantom (MeV) EMBertiniBinary ML + 5 cm water 73.5 ± ± ± 0.4 ML + 10 cm water 71.9 ± ± ± cm Al 72.9 ± ± ± cm Al 73.9 ± ± ± 0.5 Inflatable habitat vs a conventional rigid habitat
Barbara Mascialino, INFN Genova Conclusions A novel, complete software software toolkit for statistical analysis is being developed –all the two-sample GoF tests available in statistical domain + chi-squared test –rigorous architectural design –rigorous software process It is the most complete software for the comparison of two distributions, even among commercial/professional statistics tools. A systematic study of the power of GoF tests is in progress –unexplored area of research Application in various domains –Geant4, HEP, space science, medicine… Feedback and suggestions are very much appreciated