Scheduler overview status & issues

Slides:



Advertisements
Similar presentations
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Advertisements

Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.
Scheduling in Batch Systems
Chapter 8 Operating System Support (Continued)
1 Concurrency: Deadlock and Starvation Chapter 6.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
Concurrency: Deadlock and Starvation Chapter 6. Goal and approach Deadlock and starvation Underlying principles Solutions? –Prevention –Detection –Avoidance.
1 Concurrency: Deadlock and Starvation Chapter 6.
Operating Systems Chapter 8
Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts.
1 RTOS Design Some of the content of this set of slides is taken from the documentation existing on the FreeRTOS website
Status of the vector transport prototype Andrei Gheata 12/12/12.
ADAPTATIVE TRACK SCHEDULING TO OPTIMIZE CONCURRENCY AND VECTORIZATION IN GEANTV J Apostolakis, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati,
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
GeantV scheduler, concurrency Andrei Gheata GeantV FNAL meeting Fermilab, October 20, 2014.
Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.
Vector Prototype Status Philippe Canal (For VP team)
Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.
GeantV – status and plan A. Gheata for the GeantV team.
GeantV fast simulation ideas and perspectives Andrei Gheata for the GeantV collaboration CERN, May 25-26, 2016.
GeantV prototype at a glance A.Gheata Simulation weekly meeting July 8, 2014.
Scheduling fine grain workloads in GeantV A.Gheata Geant4 21 st Collaboration Meeting Ferrara, Italy September
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
Disk Cache Main memory buffer contains most recently accessed disk sectors Cache is organized by blocks, block size = sector’s A hash table is used to.
Processes and threads.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
GeantV – Structure and interfaces
Java 9: The Quest for Very Large Heaps
Uniprocessor Scheduling
The Steepest Ascent Hill Climbing Algorithm
A task-based implementation for GeantV
GeantV – Parallelism, transport structure and overall performance
GeantV – Parallelism, transport structure and overall performance
CSC 322 Operating Systems Concepts Lecture - 16: by
Report on Vector Prototype
The LHCb Event Building Strategy
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Capriccio – A Thread Model
Processor Management Damian Gordon.
CPU Scheduling Basic Concepts Scheduling Criteria
CPU Scheduling G.Anuradha
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 5: CPU Scheduling
Operating System Concepts
Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2002 Bin Sort, Radix.
Operating Systems.
Chapter5: CPU Scheduling
Chapter 5: CPU Scheduling
Process scheduling Chapter 5.
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Bin Sort, Radix Sort, Sparse Arrays, and Stack-based Depth-First Search CSE 373, Copyright S. Tanimoto, 2001 Bin Sort, Radix.
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Processor Management Damian Gordon.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
CPU Scheduling: Basic Concepts
Module 5: CPU Scheduling
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Scheduler overview status & issues A.Gheata GeantV CERN meeting Mar 16, 2015

Outlook Data structures and memory New vector balancing policy New basket management policy New prioritization policy Issues

GeantTrack_v Contiguous buffer GeantTrack GeantTrack_v 00 40 80 C0 00 fEvent fEvslot fParticle fPDG … fXpos fYpos fZpos fXdir fYdir fZdir Edep Pstep Snext Safety *fEventV *fEvslotV *fParticleV *fPDGV … *fXposV *fYposV *fZposV *fXdirV *fYdirV *fZdirV *fEdepV *fPstepV *fSnextV *fSafetyV vector 1 fEventV fEvslotV fParticle V fPDGV … fXPosV fYPosV SOA of fNtracks fZPosV … 192 bytes + 2*sizeof(VolumePath_t) fNtracks=10 padding=32 fSnextV fSafetyV fPathV fNextpathV vector 2 *fPath *fPathV sizeof(VolumePath_t) = 136 + 8*max_geom_depth Sizeof(GeantTrack_v) = 384 + fNtracks*sizeof(VolumePath_t) *fNextpath *fNextpathV

Size considerations Size of a GeantTrack_v depends on the track capacity and maximum geometry depth (15 for CMS) (256, 15) -> size = 176 kBytes (16, 15) -> size = 22 kBytes CMS tracks size for capacity of 256 tracks 4173 basket managers 2 baskets 2 track arrays per basket 4173*4*176 = 2.85 Gbytes (no baskets transported!!!) For 16 tracks per basket: 358 Mbytes Pending baskets can be cut off on memory threshold Smarter basket policy implemented

Memory threshold // Maximum user memory limit [MB] propagator->fMaxRes = 2000;

GeantBasket Elementary work unit for GeantV They currently only hold tracks that are physically inside a given logical volume Evolution: filter tracks by different criteria providing locality for any processing stage (when it’s worth) Input GeantTrack_v array, filled by the scheduler Output GeantTrack_v array, filled during transport Baskets have thread local access during transport, but concurrent access during scheduling (!) Recycled after re-basketizing to owner basket managers Mixed baskets: containing mixed tracks in different volumes Everything called with scalar interface To avoid overheads when vectorization is difficult or penalizing Input Scheduler multithreaded re-basketizing Scheduler re-basketizing Transport (single thread) Physics Output AddTrack

Basket managers One per logical volume (4173 in cms2015) 2 baskets each (4 GeantTrack_v arrays) Current: normal scheduling operations Priority: prioritize events 1 queue for concurrent replacement and recycling Dynamic track content threshold per basket, pushing to work queue Low basket flow-> small baskets, high flow-> large baskets At any moment at least 2+Nqueued baskets instantiated and held by one basket manager Plus a variable number of detached baskets being processed TGeoVolume Basket manager current priority Recycled baskets bounded queue

Vector size balancing GeantScheduler Volume current BM Ntotal = Nused + Nrecycle fThreshold* Concurrent track addition, garbage collection, collection of tracks from prioritized events Adjustable threshold, aiming for Nused = Nthreads Nused (per volume type) Transport queue

Monitoring vectorization === Thread 1: exiting === === Thread 3: exiting === === Percent of tracks transported in single track mode: 23.9395%

New basket allocation policy Monitor distribution of steps in volumes After initial total number of steps Sort by activity, sum-up bins to threshold (e.g. 90% of total steps) Activate basket managers, which will represent a fraction of the total Redo after every 4x previous number of steps

New basket allocation policy === Learning phase of 4000000 steps completed === Activated 528 volumes accounting for 90.0% of track steps * FixedShield102880: 708955 steps * HVQX8780: 462821 steps * ZDC_EMLayer9b00: 83838 steps * BeamTube22b780: 78748 steps * OQUA6780: 62597 steps * QuadInner3300: 56376 steps * ZDC_EMAbsorber9d00: 53672 steps * QuadOuter3700: 52155 steps * QuadCoil3680: 49086 steps * ZDC_EMFiber9e80: 41705 steps

New priority policy Monitor distribution of tracks per event Start prioritizing when #tracks is a fraction of the maximum reached (e.g. 1%) The remaining tracks collected by mixed baskets One mixed basket per thread Cutting short event tails no need for one priority basket per volume less basket fragmentation concurrency should get a bit better

Queue with new policy … Imported 340 tracks from events 10 to 10. Dispatched 20 baskets. ### Event 0 prioritized at 1 % threshold Event 0: 281480 tracks transported, max in flight 5187 = digitizing event 0 with 0 tracks => Importing event 11

Threads WorkloadManager::MainScheduler (1) Method run as separate thread Does queue monitoring, triggering actions WorkloadManager::GarbageCollectorThread (1) Woke up by MainScheduler every 50 iterations Garbage collects pending baskets for every basket manager WorkloadManager::MonitoringThread (1) Runs as a background thread, activating histograms on demand Work queue, memory, number of baskets per volume, concurrency, number of tracks in flight per event WorkloadManager::TransportTracks (Nworkers) Main basket transport method Calls also the re-basketizer (GeantScheduler::AddTracks)

Issues Basket management too memory hungry Pre-allocate everything policy not good for CMS We know that ~10% of volumes take >60% of transport Implement policy to create baskets only for these volumes – DONE Priority algorithm producing too fragmented baskets and taking too long to flush events Preempt starting of priority regime, independent of queue status - DONE E.g. when number of tracks in flight for one event is less than 5% of the maximum ever reached (cutting tails) Keep only prioritized tracks in the same mixed basket and reuse baskets without re-basketizing when population low.

Issues (2) Contention on re-basketizing high (specially with fat baskets) Read from many, write to one policy Highly optimized using atomics, but still… Amdahl watches Overheads in concurrent queues non negligible Currently scaling to ~6 threads Can be changed to reading concurrently the same basket, but writing to thread local one Packaging events per group of threads always possible Concurrency in the new scheduling approach to be closely monitored (VTune)