An update on the Goodness of Fit Statistical Toolkit

An update on the Goodness of Fit Statistical Toolkit
B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P. Viarengo 4th Geant4 Space Users’ Workshop

Goodness of Fit testing
Goodness-of-fit testing is the mathematical foundation for the comparison of data distributions Regression testing Throughout the software life-cycle Online DAQ Monitoring detector behaviour w.r.t. a reference Simulation validation Comparison with experimental data Reconstruction Comparison of reconstructed vs. expected distributions Physics analysis Comparison with theoretical distributions Comparisons of experimental distributions Use cases in experimental physics THEORETICAL DISTRIBUTION SAMPLE ONE-SAMPLE PROBLEM SAMPLE 2 SAMPLE 1 TWO-SAMPLE PROBLEM

“A Goodness-of-Fit Statistical Toolkit”
G.A.P Cirrone, S. Donadio, S. Guatelli, A. Mantero, B. Mascialino, S. Parlati, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo “A Goodness-of-Fit Statistical Toolkit” IEEE- Transactions on Nuclear Science (2004), 51 (5): B. Mascialino, M.G. Pia, A. Pfeiffer, A. Ribon, P. Viarengo “New developments of the Goodness-of-Fit Statistical Toolkit” IEEE- Transactions on Nuclear Science (2006), 53 (6), to be published

Software process guidelines
Adopt a process software quality Unified Process, specifically tailored to the project practical guidance and tools from the RUP both rigorous and lightweight mapping onto ISO (and CMM) Incremental and iterative life-cycle 1st cycle: 2-sample GoF tests 1-sample GoF in preparation

Architectural guidelines
The project adopts a solid architectural approach to offer the functionality and the quality needed by the users to be maintainable over a large time scale to be extensible, to accommodate future evolutions of the requirements Component-based architecture to facilitate re-use and integration in diverse frameworks layer architecture pattern core component for statistical computation independent components for interface to user analysis environments Dependencies no dependence on any specific analysis tool can be used by any analysis tools, or together with any analysis tools offer a (HEP) standard (AIDA) for the user layer

The algorithms are specialised on the kind of distribution
(binned/unbinned)

GoF algorithms in the Statistical Toolkit
TWO-SAMPLE PROBLEM GoF algorithms in the Statistical Toolkit Binned distributions Anderson-Darling test Anderson-Darling approximated test Chi-squared test Fisz-Cramer-von Mises test Tiku test (Cramer-von Mises test in chi-squared approximation) for the comparison of two distributions, even among It is the most complete software commercial/professional statistics tools. It provides all 2-sample (edf) GoF algorithms existing in statistics literature Unbinned distributions Anderson-Darling test Anderson-Darling approximated test Cramer-von Mises test Generalised Girone test Goodman test (Kolmogorov-Smirnov test in chi-squared approximation) Kolmogorov-Smirnov test Kuiper test Tiku test (Cramer-von Mises test in chi-squared approximation) Weighted Kolmogorov-Smirnov test (2 flavours) Weighted Cramer-von Mises test

User Layer Simple user layer
Shields the user from the complexity of the underlying algorithms and design Only deal with the user’s analysis objects and choice of comparison algorithm First release: user layer for AIDA analysis objects LCG Architecture Blueprint, Geant4 requirement Second release: added user layer for ROOT analysis objects in response to user requirements

Which test to use? Which test to use?
Do we really need such a wide collection of GoF tests? Why? Which is the most appropriate test to compare two distributions? How “good” is a test at recognizing real equivalent distributions and rejecting fake ones? The choice of the most suitable GoF test can be performed on the basis of two different criteria: Computational performance Statistical performance (power) Which test to use?

Unbinned Distributions
A) Performance of the GoF tests AVERAGE CPU TIME Binned Distributions Unbinned Distributions Anderson-Darling (0.69±0.01) ms (16.9±0.2) ms Anderson-Darling (approximated) (0.60±0.01) ms (16.1±0.2) ms Chi-squared (0.55±0.01) ms Cramer-von Mises (0.44±0.01) ms (16.3±0.2) ms Generalised Girone (15.9±0.2) ms Goodman (11.9±0.1) ms Kolmogorov-Smirnov (8.9±0.1) ms Kuiper (12.1±0.1) ms Tiku (16.7±0.2) ms Watson (14.2±0.1) ms Weighted Kolmogorov-Smirnov (AD) (14.0±0.1) ms Weighted Kolmogorov-Smirnov (Buning) Weighted Cramer-von Mises

B) Power of GoF tests General recipe
The power of a test is the probability of rejecting the null hypothesis correctly Systematic study of all existing GoF tests in progress made possible by the extensive collection of tests in the Statistical Toolkit GoF tests power evaluated in a variety of alternative situations considered No clear winner: the statistical performance of a test depends on the features of the distributions to be compared (skewness and tailweight) and on the sample size Practical recommendations first classify the type of the distributions in terms of skewness and tailweight choose the most appropriate test given the type of distributions evaluating the best test by means of the quantitative model proposed Topic still subject to research activity in the domain of statistics General recipe p<0.0001

Examples of practical applications

Statistical Toolkit Usage
Geant4 physics validation rigorous approach: quantitative evaluation of Geant4 physics models with respect to established reference data see for instance: K. Amako et al., Comparison of Geant4 electromagnetic physics models against the NIST reference data IEEE Trans. Nucl. Sci (2005) LCG Simulation Validation project see for instance: A. Ribon, Testing Geant4 with a simplified calorimeter setup, CMS validation of “new” histograms w.r.t. “reference” ones in OSCAR Validation Suite Usage also in space science, medicine, statistics, etc.

Validation of Geant4 e.m. physics models vs. NIST reference data
Physics models under test: Geant4 Standard Geant4 Low Energy – Livermore Geant4 Low Energy – Penelope Reference data: NIST ESTAR - ICRU 37 Experimental set-up Electron Stopping Power centre p-value stability study Geant4 LowE Penelope Geant4 Standard Geant4 LowE EEDL NIST - XCOM c2 test (to include data uncertainties in the computation of the test statistics value) p-value Geant4 LowE Penelope Geant4 Standard Geant4 LowE EEDL The three Geant4 models are equivalent H0 REJECTION AREA Z

Validation of Geant4 Atomic Relaxation vs NIST reference data
Shell-end Kolmogorov-Smirnov D p-value 10 0.0192 1 11 0.0175 13 0.0250 14 0.0256 18 0.0294 19 0.0312 21 0.1429 22 0.0588 Fluorescence - Shell-start 3  Geant4 ○ NIST

Validation of Geant4 electromagnetic and hadronic models against proton data
Low Energy EM – ICRU49: p, ions Low Energy EM – Livermore: g, e- Standard EM : e+ HadronElastic with BertiniElastic Bertini Inelastic LowE EM – ICRU49 BertiniElastic Bertini Inelastic 0.5 M events p-value CvM KS AD Left branch 0.977 Right branch 0.985 Whole curve 0.994 CvM Cramer-von Mises test KS Kolmogorov-Smirnov test AD Anderson-Darling test Geant4 Experimental data mm

X-ray fluorescence spectrum in Iceand basalt
Test beam at Bessy Bepi-Colombo mission Energy (keV) Counts X-ray fluorescence spectrum in Iceand basalt (EIN=6.5 keV) Very complex distributions c2 not appropriate (< 5 entries in some bins, physical information would be lost if rebinned) Experimental measurements are comparable with Geant4 simulations Anderson-Darling Ac (95%) =0.752

Comparison of alternative vehicle concepts in human missions to Mars
Average energy deposit (MeV) Depth in the phantom (cm) Average energy deposit of GCR p GCR p 4 cm Al - Binary set 4 cm Al - Bertini 10 cm water - Binary 10 cm water – EM 4 cm Al – EM 10 cm water - Bertini Reference: rigid structures as in the ISS (2 - 4 cm Al) Kolmogorov-Smirnov test Multi-layer + 10 cm water equivalent to 4 cm Al Multi-layer + 5 cm water equivalent to 2.15 cm Al Shielding material Energy deposited in phantom (MeV) EM Bertini Binary ML + 5 cm water 73.5 ± 0.3 130.2 ± 0.5 119.3 ± 0.4 ML + 10 cm water 71.9 ± 0.3 128.0 ± 0.5 117.3 ± 0.5 4 cm Al 72.9 ± 0.3 127.5 ± 0.5 117.0 ± 0.4 2.15 cm Al 73.9 ± 0.3 130.5 ± 0.5 119.3 ± 0.5 Inflatable habitat vs a conventional rigid habitat An inflatable habitat exhibits a shielding capability equivalent to a conventional rigid one

Conclusions A novel, complete software software toolkit for statistical analysis is being developed all the two-sample GoF tests available in statistical domain + chi-squared test rigorous architectural design rigorous software process It is the most complete software for the comparison of two distributions, even among commercial/professional statistics tools. A systematic study of the power of GoF tests is in progress unexplored area of research Application in various domains Geant4, HEP, space science, medicine… Feedback and suggestions are very much appreciated

An update on the Goodness of Fit Statistical Toolkit

Similar presentations

Presentation on theme: "An update on the Goodness of Fit Statistical Toolkit"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An update on the Goodness of Fit Statistical Toolkit

Similar presentations

Presentation on theme: "An update on the Goodness of Fit Statistical Toolkit"— Presentation transcript:

Similar presentations

About project

Feedback