Download presentation
Presentation is loading. Please wait.
1
Benefits of sampling in tracefiles Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010
2
Program Development for Extreme-Scale Computing 2May 3rd, 2010 Outline Instrumentation and sampling Folding Summarized traces Some results Current work
3
Program Development for Extreme-Scale Computing 3May 3rd, 2010 Instrumentation Performance tools based on instrumentation Granularity of the results depends on the application structure Data gathered includes: Performance counters, callstack, message size…
4
Program Development for Extreme-Scale Computing 4May 3rd, 2010 Sampling Sampling reaches any application point at a interval Easily tunable frequency Gather performance counters and callstack
5
Program Development for Extreme-Scale Computing 5May 3rd, 2010 Main objective Combine both mechanisms Deeper performance details Using PAPI_overflow(..)... what about frequency trade-off? Not too high to disrupt the performance data Not too low to get useful information
6
Program Development for Extreme-Scale Computing 6May 3rd, 2010 Work done: Folding Harald Servat, Germán Llort, Judit Giménez, Jesús Labarta: Detailed performance analysis using coarse grain sampling. PROPER, 2009. Objective: get detailed metrics with few samples Benefits from both high and low frequencies! Take advantage of stationary behavior of scientific applications Build synthetic region from scattered samples Reintroduce into the tracefile at chosen ratio
7
Program Development for Extreme-Scale Computing 7May 3rd, 2010 Folding: Moving samples Main idea: Move samples to the target iteration preserving their original relative time. Steps
8
Program Development for Extreme-Scale Computing 8May 3rd, 2010 Folding: Interpolation Instructions evolution for routine copy_faces of NAS MPI BT B No instrumentation points within the routine, but we got details Red crosses represent the folded samples and show the completed instructions from the start of the routine Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile Blue line is the derivative of the curve fitting
9
Program Development for Extreme-Scale Computing 9May 3rd, 2010 Folding areas Folding is applied to delimited regions Previously instrumented User function Iteration Automatically obtained from the gathered results Clusters of computation bursts Juan González, Judit Giménez, Jesús Labarta, Automatic detection of parallel applications computation phases, IPDPS 2009 Delimited time regions Marc Casas, Rosa M. Badia, Jesús Labarta, Automatic Structure Extraction from MPI Applications Tracefiles, Euro-Par 2007
10
Program Development for Extreme-Scale Computing 10May 3rd, 2010 Impact of the sampling frequency The more samples being fold, the more detailed results Longer executions Increase frequency Reach stability? Example: NAS BT class B copy_faces showing from 10 to 200 iterations 20 samples per second @ SGI Altix
11
Program Development for Extreme-Scale Computing 11May 3rd, 2010 Impact of the sampling frequency Choosing a sampling frequency is important Sampling frequency can couple with application frequency Choose frequencies based on prime factors
12
Program Development for Extreme-Scale Computing 12May 3rd, 2010 Outline Instrumentation and sampling Folding Summarized traces Some results Current work
13
Program Development for Extreme-Scale Computing 13May 3rd, 2010 Dealing with large scale traces Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005. Application’s behavior can be divided in: Communication phases Intensive computation phases Instrumentation library that identifies relevant computation phases
14
Program Development for Extreme-Scale Computing 14May 3rd, 2010 Dealing with large scale traces Information emitted at phase change Punctual (callstack) Aggregated Hardware Counters Software Counters Number of point-to-point and collective operations Number of bytes transferred Time in MPI
15
Program Development for Extreme-Scale Computing 15May 3rd, 2010 Example PEPC 16384 tasks on Jaguar Duration of the computation bursts # of MPI collective operations
16
Program Development for Extreme-Scale Computing 16May 3rd, 2010 Benefits of summarized tracefiles Important trace size reduction Gadget2 (128) – 10 Gbytes down to 428 Mbytes PEPC (16k) – 19 Gbytes down to 400 Mbytes PFLOTRAN (16k) – +250Gbytes down to 6 Gbytes Whole execution analysis
17
Program Development for Extreme-Scale Computing 17May 3rd, 2010 Working with large traces? We're dealing with large scale executions Maintain scalability of tracing + sampling By adding more data? Use folding to reduce data Example (Gadget2 using 128 tasks) 100 its, 5 samples/s during 90minutes ~ 236MB Folding on 1 iteration @ 200 samples/s ~ 64 MB
18
Program Development for Extreme-Scale Computing 18May 3rd, 2010 Outline Instrumentation and sampling Folding Summarized traces Combining mechanisms Some results Current work
19
Program Development for Extreme-Scale Computing 19May 3rd, 2010 Gadget2 analysis, 128 tasks 32%16% 13%8% force_tree.c +75 - gravity_tree.c +167 gravity_tree.c +528 - density.c +167 force_tree.c +1701 - hydra.c +246 predict.c +92 - pm_periodic.c +385
20
Program Development for Extreme-Scale Computing 20May 3rd, 2010 PEPC analysis, 32 tasks 45%37% 5%3% tree_aswalk.f90 +162 - tree_aswalk.f90 +380 tree_domains.f90 +548 - tree_branches.f90 +155 tree_branches.f90 +548 - tree_properties.f90 +328 tree_aswalk.f90 +380 - tree_aswalk.f90 +162
21
Program Development for Extreme-Scale Computing 21May 3rd, 2010 Current directions We work on: Is there an optimal sampling frequency? Quantify correctness and validate the results Callstack analysis
22
Program Development for Extreme-Scale Computing 22May 3rd, 2010 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.