Petascale workshop 2013 Judit Gimenez Detailed evolution of performance metrics Folding.

Slides:

Advertisements

Similar presentations

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

Advertisements

Parallel Processing with OpenMP

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Introduction to Openmp & openACC

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Intel® performance analyze tools Nikita Panov Idrisov Renat.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Tools for applications improvement George Bosilca.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Benefits of sampling in tracefiles Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.

BSC tools hands-on session. 2 Objectives Copy ~nct00001/tools-material into your ${HOME} –cp –r ~nct00001/tools-material ${HOME} Contents of.

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Performance Analysis, Profiling and Optimization of Weather Research and Forecasting (WRF) model Negin Sobhani 1,2, Davide Del Vento2,David Gill2, Sam.

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

OCR Introspection EDT Characterization & Profiling Infrastructure Intel TG Team.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.

Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Outline Why this subject? What is High Performance Computing?

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications Vladimir Subotic, Jose Carlos Sancho,

DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

Sunpyo Hong, Hyesoon Kim

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.

Parallel Computing Presented by Justin Reschke

Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Geant4 MT Performance Soon Yung Jun (Fermilab)

Investigation of the improved performance on Haswell processors

Embedded Systems Design

Loop Restructuring Loop unswitching Loop peeling Loop fusion

5.2 Eleven Advanced Optimizations of Cache Performance

Characterization of Parallel Scientific Simulations

Morgan Kaufmann Publishers

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Understanding Performance Counter Data - 1

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Hardware Counter Driven On-the-Fly Request Signatures

Lecture 11: Machine-Dependent Optimization

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Petascale workshop 2013 Judit Gimenez Detailed evolution of performance metrics Folding

Since 1991 Based on traces Open Source – Core tools: – Paraver (paramedir) – offline trace analysis – Dimemas – message passing simulator – Extrae – instrumentation Performance analytics – Detail, flexibility, intelligence – Behaviour vs syntactic structure Our Tools

What is a good performance? Performance of a sequential region = 2000 MIPS Is it good enough? Is it easy to improve?

What is a good performance? MR. GENESIS Interchanging loops

Application granularity vs. detailed granularity –Samples: hardware counters + callstack Folding: based on known structure: iterations, routines, clusters; –Project all samples into one instance Extremely detailed time evolution of hardware counts, rates and callstack with minimal overhead –Correlate many counters –Instantaneous CPI stack models Can I get very detailed perf. data with low overhead? Unveiling Internal Evolution of Parallel Application Computation Phases (ICPP 2011)

Benefit from applications’ repetitiveness Different roles –Instrumentation delimits regions –Sampling reports progress within a region Mixing instrumentation and sampling Iteration #1Iteration #2Iteration #3Synthetic Iteration Unveiling Internal Evolution of Parallel Application Computation Phases (ICPP 2011)

Instructions evolution for routine copy_faces of NAS MPI BT.B Red crosses represent the folded samples and show the completed instructions from the start of the routine Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile Blue line is the derivative of the curve fitting over time (counter rate) Folding hardware counters

Folded source code line Folded instructions Folding hardware counters with call stack

Folding hardware counters with call stack (CUBE)

10 Bursts Duration Using Clustering to identify structure Automatic Detection of Parallel Applications Computation Phases. (IPDPS 2009)

Example 1: PEPC A 96 MIPS Performance metrics (Region A) 16 MIPS 2.3 M L2 misses/s 0.1 M TLB misses/s htable%node = 0 htable%key = 0 htable%link = -1 htable%leaves = 0 htable%childcode = 0 do i = 1, n htable(i)%node = 0 htable(i)%key = 0 htable(i)%link = -1 htable(i)%leaves = 0 htable(i)%childcode = 0 End do Changes -70% time -18% instructions -63% L2 misses -78% TLB misses 253 MIPS (+163%)

Example 1: PEPC B 403 MIPS Performance metrics Region ARegion B 100 MIPS80 MIPS 4 M L2 misses/s2 M L2 misses/s 0.4 M TLB misses/s1 M TLB misses/s A

Example 1: PEPC Changes -70% time -18% instructions -63% L2 misses -78% TLB misses 253 MIPS (+163%) Changes -30% time -1% instructions -10% L2 misses -32% TLB misses 544MIPS (+34%)

Example 2: CG-POPwith CPI-Stack Folded lines –Interpolation  statistic profile Points to “small” regions iter_loop: do m = 1, solv_max_iters sumN1=c0 sumN3=c0 do i=1,nActive Z(i) = Minv2(i)*R(i) sumN1 = sumN1 + R(i)*Z(i) sumN3 = sumN3 + R(i)*R(i) enddo do i=iptrHalo,n Z(i) = Minv2(i)*R(i) enddo call matvec(n,A,AZ,Z) sumN2=c0 do i=1,nActive sumN2 = sumN2 + AZ(i)*Z(i) enddo call update_halo(AZ)... do i=1,n stmp = Z(i) + cg_beta*S(i) qtmp = AZ(i) + cg_beta*Q(i) X(i) = X(i) + cg_alpha*stmp R(i) = R(i) - cg_alpha*qtmp S(i) = stmp Q(i) = qtmp enddo end do iter_loop B D C A A BC D pcg_chrongear_linearmatvec Line number Framework for a Productive Performance Optimization (PARCO Journal 2013)

Example 2: CG-POP sumN1=c0 sumN3=c0 do i=1,nActive Z(i) = Minv2(i)*R(i) sumN1 = sumN1 + R(i)*Z(i) sumN3 = sumN3 + R(i)*R(i) enddo do i=iptrHalo,n Z(i) = Minv2(i)*R(i) enddo iter_loop: do m = 1, solv_max_iters sumN2=c0 call matvec_r(n,A,AZ,Z,nActive,sumN2) call update_halo(AZ)... sumN1=c0 sumN3=c0 do i=1,n stmp = Z(i) + cg_beta*S(i) qtmp = AZ(i) + cg_beta*Q(i) X(i) = X(i) + cg_alpha*stmp R(i) = R(i) - cg_alpha*qtmp S(i) = stmp Q(i) = qtmp Z(i) = Minv2(i)*R(i)} if (i <= nActive) then} sumN1 = sumN1 + R(i)*Z(i) sumN3 = sumN3 + R(i)*R(i) endif enddo end do iter_loop iter_loop: do m = 1, solv_max_iters sumN1=c0 sumN3=c0 do i=1,nActive Z(i) = Minv2(i)*R(i) sumN1 = sumN1 + R(i)*Z(i) sumN3 = sumN3 + R(i)*R(i) enddo do i=iptrHalo,n Z(i) = Minv2(i)*R(i) enddo call matvec(n,A,AZ,Z) sumN2=c0 do i=1,nActive sumN2 = sumN2 + AZ(i)*Z(i) enddo call update_halo(AZ)... do i=1,n stmp = Z(i) + cg_beta*S(i) qtmp = AZ(i) + cg_beta*Q(i) X(i) = X(i) + cg_alpha*stmp R(i) = R(i) - cg_alpha*qtmp S(i) = stmp Q(i) = qtmp enddo end do iter_loop D C AB CD B A

Example 2: CG-POP AB CD 11% improvement on an already optimized code B D C A CD AB

Example 3: CESM

4 cycles in Cluster 1 A BC Group A: –conden:2.7% –compute_uwshcu:3.3% –rtrnmc:1.75% Group B: –micro_mg_tend:1.36% (1.73%) –wetdepa_v2:2.5% Group C: –reftra_sw:1.71% –spcvmc_sw:1.21% –vrtqdr_sw1.43%

Example 3: CESM Consists of a double nested loop –Very long ~400 lines –Unnecessary branches with inhibit vectorization Restructuring wetdepa_v2 –Break up long loop to simplify vectorization –Promote scalar to vector temporaries –Common expression elimination CESM B-case, NE=16, 570 cores Yellowstone, Intel (13.1.1) –O2 % total timeduration (ms) improvement original modified x

Energy SandyBridge 3 Energy Domains –Processor die (Package) –Cores (PP0) –Attached RAM (optional, DRAM) In comparison with performance counters –Per processor die information –Time discretization Measured at 1Khz  No control on boundaries (f.i separate MPI from computing) –Power quantization Energy reported in multiples of 15.3 µJoules Folding energy counters –Noise values Discretization – consider a uniform distribution? Quantization – select the latest valid measure?

Folding energy counters in serial benchmarks MIPS Core DRAM PACKAGE TDP FT.BLU.B 444.namd481.wrf437.leslie3d435.gromacs BT.B Stream

HydroC analysis HydroC, 8 MPI processes –Intel® Xeon® 2.60GHz (2 x octo-core nodes) 1 pps2 pps 4 pps8 pps

MrGenesis analysis MrGenesis, 8 MPI processes –Intel® Xeon® 2.60GHz (2 x octo-core nodes) 1 pps2 pps 4 pps8 pps

Performance answers are in detailed and precise analysis Analysis: [temporal] behaviour vs syntactic structure Conclusions