Philippe Canal (FNAL) for the GeantV development team G.Amadio (UNESP), Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.

Slides:



Advertisements
Similar presentations
From Quark to Jet: A Beautiful Journey Lecture 1 1 iCSC2014, Tyler Dorland, DESY From Quark to Jet: A Beautiful Journey Lecture 1 Beauty Physics, Tracking,
Advertisements

Intel® performance analyze tools Nikita Panov Idrisov Renat.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
U-Solids: new geometrical primitives library for Geant4 and ROOT Marek Gayer CERN Physics Department (PH) Group Software Development for Experiments (SFT)
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.
Geometry and Field: New features and Developments T. Nikitina For Geometry Working Group Geant4 workshop, Hebden Bridge september 2007 G4Paraboloid.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Vectorizing the detector geometry to optimize particle transport CHEP 2013, Amsterdam J.Apostolakis, R.Brun, F.Carminati, A.Gheata, S.Wenzel.
Darmstadt, 15. November 2015 Tobias Stockmanns, FZ Jülich1 A STEP to ROOT converter for the FairRoot framework ALICE-FAIR Computing Meeting, GSI,
Geant4 in production: status and developments John Apostolakis (CERN) Makoto Asai (SLAC) for the Geant4 collaboration.
New software library of geometrical primitives for modelling of solids used in Monte Carlo detector simulations Marek Gayer, John Apostolakis, Gabriele.
Parallellising Geant4 John Allison Manchester University and Geant4 Associates International Ltd 16-Jan-2013Geant4-MT John Allison Hartree Meeting1.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
The CMS Simulation Software Julia Yarba, Fermilab on behalf of CMS Collaboration 22 m long, 15 m in diameter Over a million geometrical volumes Many complex.
VMC workshop1 Ideas for G4 navigation interface using ROOT geometry A.Gheata ALICE offline week, 30 May 05.
PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
STATUS OF THE UNIFIED SOLIDS LIBRARY Gabriele Cosmo/CERN Tatiana Nikitina/CERN.
U-Solids: new geometrical primitives library for Geant4 and ROOT Marek Gayer CERN Physics Department (PH) Group Software Development for Experiments (SFT)
Geant4 CPU performance : an update Geant4 Technical Forum, CERN, 07 November 2007 J.Apostolakis, G.Cooperman, G.Cosmo, V.Ivanchenko, I.Mclaren, T.Nikitina,
Single Node Optimization Computational Astrophysics.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Status of BDSIM Simulation L. Nevay, S. Boogert, H. Garcia-Morales, S. Gibson, R. Kwee-Hinzmann, J. Snuverink Royal Holloway, University of London 17 th.
Status of Sirene Maarten de Jong. What?  Sirene is yet another program that simulates the detector response to muons and showers  It uses a general.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Andrei Gheata (CERN) for the GeantV development team G.Amadio (UNESP), A.Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
STAR Simulation. Status and plans V. Perevoztchikov Brookhaven National Laboratory,USA.
GeantV code sprint report Fermilab 3-8 October 2015 A. Gheata.
Status of the development of the Unified Solids library Marek Gayer, CERN PH/SFT 2 nd AIDA Annual Meeting, Frascati 2013.
Vector Prototype Status Philippe Canal (For VP team)
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
GeantV – status and plan A. Gheata for the GeantV team.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
LIT participation LIT participation Ivanov V.V. Laboratory of Information Technologies Meeting on proposal of the setup preparation for external beams.
MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.
Monthly video-conference, 18/12/2003 P.Hristov1 Preparation for physics data challenge'04 P.Hristov Alice monthly off-line video-conference December 18,
LHCb Simulation LHCC Computing Manpower Review 3 September 2003 F.Ranjard / CERN.
Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
Update on USolids/VecGeom integration in Geant4 Gabriele Cosmo, CERN EP/SFT.
Common Geometry Primitives library WP3 – date
Virtualization.
OPERATING SYSTEMS CS 3502 Fall 2017
Federico Carminati Feb 1, 2017
Process Management Process Concept Why only the global variables?
Microarchitecture.
Tom LeCompte High Energy Physics Division Argonne National Laboratory
GeantV Philippe Canal (FNAL) for the GeantV development team
Geant4 MT Performance Soon Yung Jun (Fermilab)
European Organization for Nuclear Research
Intel Code Modernisation Project: Status and Plans
Recent performance improvements in ALICE simulation/digitization
Kilohertz Decision Making on Petabytes
Vector Electromagnetic Physics Models & Field Propagation
CPU Benchmarks Parallel Session Summary
A task-based implementation for GeantV
Simulation issues for the CWP
Report on Vector Prototype
Lecture 2 The Art of Concurrency
CSE 471 Autumn 1998 Virtual memory
COMP755 Advanced Operating Systems
Presentation transcript:

Philippe Canal (FNAL) for the GeantV development team G.Amadio (UNESP), Ananya (CERN), J.Apostolakis (CERN), A.Arora (CERN), M.Bandieramonte (CERN), A.Bhattacharyya (BARC), C.Bianchini (UNESP), R.Brun (CERN), Ph.Canal (FNAL), F.Carminati (CERN), L.Duhem (intel), D.Elvira (FNAL), A.Gheata (CERN), M.Gheata (CERN), I.Goulas (CERN), R.Iope (UNESP), S.Y.Jun (FNAL), G.Lima (FNAL), A.Mohanty (BARC), T.Nikitina (CERN), M.Novak (CERN), W.Pokorski (CERN), A.Ribon (CERN), R.Sehgal (BARC), O.Shadura (CERN), S.Vallecorsa (CERN), S.Wenzel (CERN), Y.Zhang (CERN) Are we there yet?

Yardstick: CMS With Tabulated Physics Realistic Scale Simulation  pp 14TeV minimum bias events produced by Pythia 8  2015 CMS detector  4T uniform magnetic field  Decent approximation of the real solenoidal field  Low energy cut at 1MeV  ‘Tabulated’ Physics  Library of sampled interactions and tabulated x-sections  Same test (described above) run with both Geant4 and GeantV with various versions of the Geometry library. 2

Geometry - VecGeom 3 Geometry takes 30-40% CPU time of typical Geant4 HEP Simulation A library of vectorised geometry algorithms to take maximum advantage of SIMD architectures Substantial performance gains also in scalar mode Better scalar code

Example: Polycone 4 Speedup factor 3.3x vs. Geant4, 7.6x vs. Root for most performance critical methods, i.e.: It is today possible to run Geant4 simulations with Usolids/VecGeom shapes replacing Geant4 shapes (seamless to user) VecGeom/USolids library is now shipped as an independent external library since release 10.2 of Geant4 USolids source code repository: gitlab.cern.ch/VecGeom/VecG eom VecGeom/USolids is developed as part of the EU AIDA-2020 initiative.

The X-Ray benchmark  The X-Ray benchmark tests geometry navigation in a real detector geometry  X-Ray scans a module with virtual rays in a grid corresponding to pixels on the final image  Each ray is propagated from boundary to boundary  Pixel gray level determined by number of crossings  A simple geometry example (concentric tubes) emulating a tracker detector used for Xeon©Phi benchmark  To probe the vectorized geometry elements + global navigation as task  OMP parallelism + “basket” model 5 OMP threads

Vector performance  Gaining up to 4.5 from vectorization in basketized mode  Approaching the ideal vectorization case (when no regrouping of vectors is needed)  Vector starvation starts when filling more thread slots than the core count  Performance loss is not dramatic  Better vectorization compared to the Sandy- Bridge host (expected)  Scalar case: Simple loop over pixels  Ideal vectorization case: Fill vectors with N times the same X- ray  Realistic (basket) case: Group baskets per geometry volume 6

Results on the Xeon Phi  Ideal parallelism for the code “hotspot”  Vectorisation intensity in excess of 10 for the most computationally demanding portions of the code 7

I/O On HPC Challenges:  Individual output inherently has variable size (different number of particles, vertices, etc.)  Variability furthered by compression  Preparing data for output is CPU intensive  In practice can not do parallel write to the same file  Usual solution does not scale to HPC  i.e. write one file per process. 8 Solution being explored:  Extend “Buffer” mode to multiple nodes  Set up a tree of nodes sending upstream the prepared output  Sending data in chunk during processing to maximize parallelism  Only a few end nodes processes or threads actually write to disks  Cardinality close to the number of physical disk

Putting It All Together - CMS Yardstick SchedulerGeometryPhysicsMagnetic Field Stepper Geant4 onlyLegacy G4Various Physics Lists Various RK implementations Geant4 or GeantV VecGeom 2016 scalar Tabulated Physics Scalar Physics Code Helix Cash-Karp Runge-Kutta GeantV only VecGeom 2015 VecGeom 2016 vector Legacy TGeo Vector Physics Code Vectorized RK Implementation 9 Semantic changes

Putting It All Together - CMS Yardstick SchedulerGeometryPhysicsMagnetic Field Stepper Geant4 onlyLegacy G4 Various Physics Lists Various RK implementations Geant4 or GeantV VecGeom 2016 scalar Tabulated Physics Scalar Physics Code Helix (Fixed Field) Cash-Karp Runge-Kutta GeantV only VecGeom 2015 VecGeom 2016 vector Legacy TGeo Vector Physics Code Vectorized RK Implementation 10 Semantic changes

Putting It All Together - CMS Yardstick  Some of the improvements can be back ported to G4  Overhead of basket handling is under control  Ready to take advantage of vectorization throughout. Improvement Factors (total) with respect to G4 Legacy (TGeo) Geometry library:  1.5  Algorithmic improvements in infrastructure VecGeom (estimate)  2.4  Algorithmic improvements in Geometry Early 2016 VecGeom  3.3  Further Geometric algorithmic improvements and some vectorization 11

12 g4 4T RKgv 4T RKdiffratiog4 0Tgv 0T total init infra touchabale PropagteTracksScalar PropagateTracks TabPhys User TabPhys (AlongStep and PostSTep G4Transportation::PostStep * including LocatePoint Touchable DefinePhysicalLength Physics PropagateInField rest diffs

PropagateTracks vs PropagateTracksScalar : (% is of the whole run). with scalar nav: 25.2% / 32.5% ; 119s / 153s with vector nav: 26.0% / 32.1% ; 125s / 155s For the vector navigator GetSatety takes 15s. The overhead inside ComputeTransportLength for the vector navigator is also noticeable ( 3.4s ) 3 navigator seems to be called in both cases and the time spend in them is similar (slighty faster for the vector navigator): NewSimpleNav vector / scalar : 31.7s / 34.9s HybridNav vector / scalar : 29.4s / 33.9s SimpleABBox vector / scalar : 1.9s / 2.2s This is 'because' most calls inside take the same time, for example GlobalLocator::LocalGlobalPoint and Polycone, Polyhedron's DistanceToIn and DistanceToOut The timing for Tube and Box does change: Tube DistanceToOut vector / scalar : 1.08s / 1.93s (18.2 vs e6 calls) Box DistanceToOut vector / scalar : 0.36s / 0.53s Tube DistanceToIn vector / scalar : 0.39s / 0.23s !!!!! (5.8 vs e6 calls) which obviously won't make a large difference. I do not why the vector tube DistanceToIn is seemingly slower. When looking at the assembly code (that vtune claim to correspond to Tube's DistanceToOut) it does look like the 'scalar' version is using vector instructions. 13

Vector Nav Total:598s PropagateTracksScalar: 225s PropagateTracks: 167s GUFieldPropagator:58s ComputeTransportLen..:75s NavFindNextBound..:71s GetSafety:15.3s NewSimpleNav: 28s HybridNav:25s SimpleABBox:1.8s 14

VectorNav NewSimpleNav Chunk: 26.3s Self: 8s LocateGlobalPoint:5.4s PolyconeImpl DistanceToOut: 2.4s PolyconeImpl DistanceToIn: 1.7s Polyhedron DistanceToOut:1.6s TubeImpl DistanceToOut:1.0s NewSimpleNav *without* vec: 30.35s Self:8.7s LocateGlobalPoint:7.3s 15

16

Thank you! 17

BACKUPS 18