High Performance Computing: Concepts, Methods & Means Performance I: Benchmarking Prof. Thomas Sterling Department of Computer Science Louisiana State.

Slides:



Advertisements
Similar presentations
Topics to be discussed Introduction Performance Factors Methodology Test Process Tools Conclusion Abu Bakr Siddiq.
Advertisements

Performance Testing - Kanwalpreet Singh.
11 Measuring performance Kosarev Nikolay MIPT Feb, 2010.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
Computer Abstractions and Technology
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Performance Analysis of Multiprocessor Architectures
MATH 685/ CSI 700/ OR 682 Lecture Notes
C-Store: Introduction to TPC-H Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
Lecture 2c: Benchmarks. Benchmarking Benchmark is a program that is run on a computer to measure its performance and compare it with other machines Best.
Chapter 4 M. Keshtgary Spring 91 Type of Workloads.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Introduction CS 524 – High-Performance Computing.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
The Problem With The Linpack Benchmark 1.0 Matrix Generator Jack J. Dongarra and Julien Langou International Journal of High Performance Computing Applications.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Benchmarks Programs specifically chosen to measure performance Must reflect typical workload of the user Benchmark types Real applications Small benchmarks.
SPEC 2006 CSE 820. Michigan State University Computer Science and Engineering Q1. What is SPEC? SPEC is the Standard Performance Evaluation Corporation.
Using Standard Industry Benchmarks Chapter 7 CSE807.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
CSC 7600 Lecture 4: Benchmarking, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS BENCHMARKING Prof. Thomas Sterling Department of Computer.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Computer Performance Computer Engineering Department.
BİL 221 Bilgisayar Yapısı Lab. – 1: Benchmarking.
Memory/Storage Architecture Lab Computer Architecture Performance.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Computer Architecture
1 Seoul National University Performance. 2 Performance Example Seoul National University Sonata Boeing 727 Speed 100 km/h 1000km/h Seoul to Pusan 10 hours.
CS240A: Conjugate Gradients and the Model Problem.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Performance Performance
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
CS203 – Advanced Computer Architecture Performance Evaluation.
Vector computers.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Lecture 2: Performance Evaluation
Prof. Thomas Sterling Department of Computer Science
CMSC 611: Advanced Computer Architecture
Performance of computer systems
CMSC 611: Advanced Computer Architecture
Performance of computer systems
Memory System Performance Chapter 3
COMS 361 Computer Organization
Presentation transcript:

High Performance Computing: Concepts, Methods & Means Performance I: Benchmarking Prof. Thomas Sterling Department of Computer Science Louisiana State University January 23 rd, 2007

2 Topics Definitions, properties and applications Early benchmarks Everything you ever wanted to know about Linpack (but were afraid to ask) Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

3 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

4 Basic Performance Metrics Time related: –Execution time [seconds] wall clock time system and user time –Latency –Response time Rate related: –Rate of computation floating point operations per second [flops] integer operations per second [ops] –Data transfer (I/O) rate [bytes/second] Effectiveness: –Efficiency [%] –Memory consumption [bytes] –Productivity [utility/($*second)] Modifiers: –Sustained –Peak –Theoretical peak

5 What Is a Benchmark? The term “benchmark” also commonly applies to specially-designed programs used in benchmarking A benchmark should: –be domain specific (the more general the benchmark, the less useful it is for anything in particular) –be a distillation of the essential attributes of a workload –avoid using single metric to express the overall performance Computational benchmark kinds –synthetic: specially-created programs that impose the load on the specific component in the system –application: derived from a real-world application program Benchmark: a standardized problem or test that serves as a basis for evaluation or comparison (as of computer system performance) [Merriam-Webster]

6 Purpose of Benchmarking To define the playing field To provide a tool enabling quantitative comparisons Acceleration of progress –enable better engineering by defining measurable and repeatable objectives Establishing of performance agenda –measure release-to-release or version-to-version progress –set goals to meet –be understandable and useful also to the people not having the expertise in the field (managers, etc.)

7 Properties of a Good Benchmark Relevance: meaningful within the target domain Understandability Good metric(s): linear, orthogonal, monotonic Scalability: applicable to a broad spectrum of hardware/architecture Coverage: does not over-constrain the typical environment Acceptance: embraced by users and vendors Has to enable comparative evaluation Limited lifetime: there is a point when additional code modifications or optimizations become counterproductive Adapted from: Standard Benchmarks for Database Systems by Charles Levine, SIGMOD ‘97

8 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

9 Early Benchmarks Whetstone –Floating point intensive Dhrystone –Integer and character string oriented Livermore Fortran Kernels –“Livermore Loops” –Collection of short kernels NAS kernel –7 Fortran test kernels for aerospace computation The sources of the benchmarks listed above are available from:

10 Whetstone Originally written in Algol 60 in 1972 at the National Physics Laboratory (UK) Named after Whetstone Algol translator- interpreter on the KDF9 computer Measures primarily floating point performance in WIPS: Whetstone Instructions Per Second Raised also the issue of efficiency of different programming languages The original Algol code was translated to C and Fortran (single and double precision support), PL/I, APL, Pascal, Basic, Simula and others

11 Dhrystone Synthetic benchmark developed in 1984 by Reinhold Weicker The name is a pun on “Whetstone” Measures integer and string operations performance, expressed in number of iterations, or Dhrystones, per second Alternative unit: D-MIPS, normalized to VAX 11/780 performance Latest version released: 2.1, includes implementations in C, Ada and Pascal Superseded by SPECint suite Gordon Bell and VAX 11/780

12 Livermore Fortran Kernels (LFK) Developed at Lawrence Livermore National Laboratory in 1970 –also known as Livermore Loops Consists of 24 separate kernels: –hydrodynamic codes, Cholesky conjugate gradient, linear algebra, equation of state, integration, predictors, first sum and difference, particle in cell, Monte Carlo, linear recurrence, discrete ordinate transport, Planckian distribution and others –include careful and careless coding practices Produces 72 timing results using 3 different DO-loop lengths for each kernel Produces Megaflops values for each kernel and range statistics of the results Can be used as performance, compiler accuracy (checksums stored in code) or hardware endurance test

13 NAS Kernel Developed at the Numerical Aerodynamic Simulation Projects Office at NASA Ames Focuses on vector floating point performance Consists of 7 test kernels in Fortran (approx lines of code): –matrix multiply –complex 2-D FFT –Cholesky decomposition –block tri-diagonal matrix solver –vortex method setup with Gaussian elimination –vortex creation with boundary conditions –parallel inverse of three matrix pentadiagonals Reports performance in Mflops (64-bit precision)

14 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

15 Linpack Overview Introduced by Jack Dongarra in 1979 Based on LINPACK linear algebra package developed by J. Dongarra, J. Bunch, C. Moler and P. Stewart (now superseded by the LAPACK library) Solves a dense, regular system of linear equations, using matrices initialized with pseudo-random numbers Provides an estimate of system’s effective floating-point performance Does not reflect the overall performance of the machine!

16 Linpack Benchmark Variants Linpack Fortran (single processor) –N=100 –N=1000, TPP, best effort Linpack’s Highly Parallel Computing benchmark (HPL) Java Linpack

17 Fortran Linpack (I) N=100 case Provides results listed in Table 1 of “Linpack Benchmark Report” Absolutely no changes to the code can be made (not even in comments!) Matrix generated by the program must be used to run this case An external timing function (SECOND) has to be supplied Only compiler-induced optimizations allowed Measures performance of two routines –DGEFA: LU decomposition with partial pivoting –DGESL: solves system of linear equations using result from DGEFA Complexity: O(n 2 ) for DGESL, O(n 3 ) for DGEFA

18 Fortran Linpack (II) N=1000 case, Toward Peak Performance (TPP), Best Effort Provides results listed in Table 1 of “Linpack Benchmark Report” The user can choose any linear equation to be solved Allows a complete replacement of the factorization/solver code by the user No restriction on the implementation language for the solver The solution must conform to prescribed accuracy and the matrix used must be the same as the matrix used by the netlib driver

19 Linpack Fortran Performance on Different Platforms ComputerN=100 [MFlops] N=1000, TPP [MFlops] Theoretical Peak [MFlops] Intel Pentium Woodcrest (1core, 3 GHz) NEC SX-8/8 (8 proc., 2 GHz) NEC SX-8/8 (1 proc., 2 GHz) HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon) HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon) IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz) IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz) SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz) HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz) HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz) Fujitsu VPP5000/1 (1 proc., 3.33ns) Cray T932 (32 proc., 2.2ns)1129 (1 proc.) HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz) HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz) HP 9000 rp (8 PA-8800 proc., 1000MHz) HP 9000 rp (1 PA-8800 proc., 1000MHz) Data excerpted from the LINPACK Benchmark Report at

20 Fortran Linpack Demo >./linpack Please send the results of this run to: Jack J. Dongarra Computer Science Department University of Tennessee Knoxville, Tennessee Fax: Internet: This is version norm. resid resid machep x(1) x(n) E E E E E+00 times are reported for matrices of order 100 dgefa dgesl total mflops unit ratio b(1) times for array with leading dimension of E E E E E E E E E E E E E E E E E E E E E E E E E E E E+02 times for array with leading dimension of E E E E E E E E E E E E E E E E E E E E E E E E E E E E+02 end of tests -- this version dated 05/29/04 Reference: Time spent in matrix factorization routine (dgefa) Time spent in solver (dgesl) Total time (dgefa+dgesl) Sustained floating point rate “Timing” unit (obsolete) Fraction of Cray-1S execution time (obsolete) First element of right hand side vector Two different dimensions used to test the effect of array placement in memory

21 Linpack’s Highly Parallel Computing Benchmark (HPL) Measures the performance of distributed memory machines Used in the “Linpack Benchmark Report” (Table 3) and to determine the order of machines on the Top500 list The portable version (written in C) External dependencies: –MPI-1.1 functionality for inter-node communication –BLAS or VSIPL library for simple vector operations such as scaled vector addition (DAXPY: y = αx+y) and inner dot product (DDOT: a = Σx i y i ) Ground rules: –allows a complete user replacement of the LU factorization and solver steps (the accuracy must satisfy given bound) –same matrix as in the driver program –no restrictions on problem size

22 HPL Algorithm Data distribution: 2-D block-cyclic Algorithm elements: –right-looking variant of LU factorization with row partial pivoting featuring multiple look-ahead depths –recursive panel factorization with pivot search and column broadcast combined –various virtual panel broadcast topologies –bandwidth reducing swap-broadcast algorithm –backward substitution with look-ahead depth of one Floating point operation count: 2/3·n 3 +n 2

HPL Algorithm Elements Reference: Panel Factorization Panel Broadcast Look-ahead Backward Substitution Update Solution Check All columns of A processed? Y N Six broadcast algorithms available Matrix distribution scheme over P×Q grid of processors: Right looking variant of LU factorization is used. In each iteration of the loop a panel of NB columns is factorized and the trailing submatrix is updated: Execution flow for single parameter set: Matrix Generation

24 HPL Linpack Metrics The HPL implementation of the benchmark is run for different problem sizes N on the entire machine For certain problem size N max, the cumulative performance in Mflops (reflecting 64-bit addition and multiplication operations) reaches its maximum value denoted as R max Another metric possible to obtain from the benchmark is N 1/2, the problem size for which the half of the maximum performance (R max /2) is achieved The R max value is used to rank supercomputers in Top500 list; listed along with this number are the theoretical peak double precision floating point performance R peak of the machine and N 1/2

25 Machine Parameters Influencing Linpack Performance ParameterLinpack Fortran, N=100 Linpack Fortran, N=1000, TPP HPL Processor speedYes Memory capacityNoNo (modern system) Yes (for R max ) Network latency/bandwidth No Yes Compiler flagsYes

26 Ten Fastest Supercomputers On Current Top500 List #ComputerSiteProcessorsR max R peak 1IBM Blue Gene/LDoE/NNSA/LLNL (USA)131,072280,600367,000 2Cray Red StormSandia (USA)26,544101,400127,411 3IBM BGWIBM T. Watson Research Center (USA)40,96091,290114,688 4IBM ASC PurpleDoE/NNSA/LLNL (USA)12,20875,76092,781 5IBM Mare NostrumBarcelona Supercomputing Center (Spain)10,24062,63094,208 6Dell ThunderbirdNNSA/Sandia (USA)9,02453,00064,973 7Bull Tera-10Commissariat a l’Energie Atomique (France)9,96852,84063,795 8SGI ColumbiaNASA/Ames Research Center (USA)10,16051,87060,960 9NEC/Sun TsubameGSIC Center, Tokyo Institute of Technology (Japan) 11,08847,38082,125 10Cray JaguarOak Ridge National Laboratory (USA)10,42443,48054,205 Source:

27 Java Linpack Intended mostly to measure the efficiency of Java implementation rather than hardware floating point performance Solves a dense 500x500 system of linear equations with one right-hand side, Ax=b Matrix A is generated randomly Vector b is constructed, so that all component of solution x are one Uses Gaussian elimination with partial pivoting Reports: Mflops, time to solution, Norm Res (solution accuracy), relative machine precision

============================================================================ T/V N NB P Q Time Gflops WR01L2L e ||Ax-b||_oo / ( eps * ||A||_1 * N ) = PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = PASSED ============================================================================ T/V N NB P Q Time Gflops WR01L2L e ||Ax-b||_oo / ( eps * ||A||_1 * N ) = PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = PASSED ============================================================================ T/V N NB P Q Time Gflops WR01L2L e ||Ax-b||_oo / ( eps * ||A||_1 * N ) = PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = PASSED ============================================================================ Finished 3 tests with the following results: 3 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values End of Tests. ============================================================================ HPL Demo 28 > mpirun -np 4 xhpl ============================================================================ HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK ============================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 5000 NB : 32 PMAP : Row-major process mapping P : Q : PFACT : Left NBMIN : 2 NDIV : 2 RFACT : Left BCAST : 1ringM DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be e-16 - Computational tests pass if scaled residuals are less than 16.0 For configuration issues, consult:

29 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

30 Other Parallel Benchmarks High Performance Computing Challenge (HPCC) benchmarks –Devised and sponsored to enrich the benchmarking parameter set NAS Parallel Benchmarks (NPB) –Powerful set of metrics –Reflects computational fluid dynamics NPBIO-MPI –Stresses external I/O system

31 HPC Challenge Benchmark Consists of 7 individual tests: HPL (Linpack TPP): floating point rate of execution of a solver of linear system of equations DGEMM: floating point rate of execution of double precision matrix-matrix multiplication STREAM: sustainable memory bandwidth (GB/s) and the corresponding computation rate for simple vector kernel PTRANS (parallel matrix transpose): total capacity of the network using pairwise communicating processes RandomAccess: the rate of integer random updates of memory (in GUPS: Giga-Updates Per Second) FFT: floating point rate of execution of double precision complex 1-D Discrete Fourier Transform b_eff (effective bandwidth benchmark): latency and bandwidth of a number of simultaneous communication patterns

32 Comparison of HPCC Results on Selected Supercomputers Notes: all metrics shown are “higher-better”, except for the Random Ring Latency machine labels include: machine name (optional), manufacturer and system name, affiliation and (in parentheses) processor/network fabric type

33 NAS Parallel Benchmarks Derived from computational fluid dynamics (CFD) applications Consist of five kernels and three pseudo-applications Exist in several flavors: –NPB 1: original paper-and-pencil specification generally proprietary implementations by hardware vendors –NPB 2: MPI-based sources distributed by NAS supplements NPB 1 can be run with little or no tuning –NPB 3: implementations in OpenMP, HPF and Java derived from NPB-serial version with improved serial code a set of multi-zone benchmarks was added test implementation efficiency of multi-level and hybrid parallelization methods and tools (e.g. OpenMP with MPI) –GridNPB 3: new suite of benchmarks, designed to rate the performance of computational grids includes only four benchmarks, derived from the original NPB written in Fortran and Java Globus as grid middleware

34 NPB 2 Overview Multiple problem classes (S, W, A, B, C, D) Tests written mainly in Fortran (IS in C): –BT (block tri-diagonal solver with 5x5 block size) –CG (conjugate gradient approximation to compute the smallest eigenvalue of a sparse, symmetric positive definite matrix) –EP (“embarrassingly parallel”; evaluates an integral by means of pseudorandom trials) –FT (3-D PDE solver using Fast Fourier Transforms) –IS (large integer sort; tests both integer computation speed and network performance) –LU (a regular-sparse, 5x5 block lower and upper triangular system solver) –MG (simplified multigrid kernel; tests both short and long distance data communication) –SP (solves multiple independent system of non-diagonally dominant, scalar, pentadiagonal equations) Sources and reports available from:

35 NPBIO-MPI Attempts to address lack of I/O tests in NPB, focusing primarily on file output Based on BTIO effort, which extended BT benchmark with routines writing to storage five double precision numbers for every mesh point –runs for 200 iterations, writing every five iterations –after all time steps are finished, all data belonging to a single time step must be stored in the same file, sorted by vector components –timing must include all required data rearrangements to achieve the specified data layout Supported access scenarios: –simple: MPI-IO without collective buffering –full: MPI-IO collective buffering –fortran: Fortran 77 file operations –epio: where each process writes continuously its part of the computational domain to a separate file Number of processes must be a square Problem sizes: class A (64 3 ), class B (102 3 ), class C (162 3 ) Several possible results, depending on the benchmarking goal: effective flops, effective output bandwidth or output overhead

36 Sample NPB 2 Results Reference: The NAS Parallel Benchmarks 2.1 Results by W. Saphir, A. Woo, and M. Yarrow

37 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

38 Benchmarking Organizations SPEC –Created to satisfy the need for realistic, fair and standardized performance tests –Motto: “An ounce of honest data is worth more than a pound of marketing hype” TPC –Formed primarily due to lack of reliable database benchmarks

39 SPEC Benchmark Suite Overview Standard Performance Evaluation Corporation is a non-profit organization (financed by its members: over 60 leading computer and software manufacturers) founded in 1988 SPEC benchmarks are written in platform-neutral language (typically C or Fortran) The code may be compiled using arbitrary compilers, but the sources may not be modified –many manufacturers are known to optimize their compilers and/or systems to improve the SPEC results Benchmarks may be obtained by purchasing the license from SPEC; the results are published on the SPEC website Website:

40 SPEC Suite Components SPEC CPU2006: combined performance of CPU, memory and compiler –CINT2006 (aka. SPECint): integer arithmetic test using compilers, interpreters, word processors, chess programs, etc. –CFP2006 (aka. SPECfp): floating point test using physical simulations, 3D graphics, image processing, computational chemistry, etc. SPECweb2005: PHP/JSP performance SPECviewperf: OpenGL 3D graphic system performance SPECapc: several popular 3D-intensive applications SPEC HPC2002: high-end parallel computing tests using quantum chemistry application, weather modeling, industrial oil deposits locator SPEC OMP2001: OpenMP application performance SPECjvm98: performance of java client on a Java VM SPECjAppServer2004: multi-tier benchmark measuring the performance of J2EE application servers SPECjbb2005: server-side Java performance SPEC MAIL2001: mail server performance (SMTP and POP) SPEC SFS97_R1: NFS server throughput and response time Planned: SPEC MPI2006, SPECimap, SPECpower, Virtualization

41 Sample Results: SPEC CPU2006 SystemCINT2006 Speed CFP2006 Speed CINT2006 Rate CFP2006 Rate basepeakbasepeakbasepeakbasepeak Dell Precision 380 (Pentium EE GHz, 2cores) HP ProLiant DL380 G4 (Xeon 3.8GHz, 2 cores) HP ProLiant DL585 (Opteron GHz, 2 cores) Sun Blade 2500 (1 UltraSPARC IIIi, 1280MHz)4.04 Sun Fire E25K (UltraSPARC IV+ 1500MHz, 144 cores) HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 2 cores) HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 8 cores) HP Integrity Superdome (Itanium2 1.6GHz/24MB, 128 cores) Notes: base metric requires that the same flags are used when compiling all instances of the benchmark (peak is less strict) speed metric measures how fast a computer executes single task, while rate determines throughput with multiple tasks

42 TPC Governed by the Transaction Processing Performance Council ( founded in 1985http:// –members include leading system and microprocessor manufacturers, and commercial database developers –the council appoints professional affiliates and auditors outside the member group to help fulfill the TPC’s mission and validate benchmark results Current benchmark flavors: –TPC-C for transaction processing (de-facto standard for On-Line Transaction Processing) –TPC-H for decision support systems –TPC-App for web services Obsolete benchmarks: –TPC-A (performance of update-intensive databases) –TPC-B (throughput of a system in transactions per second) –TPC-D (decision support applications with long running queries against complex data structures) –TPC-R (business reporting, decision support) –TPC-W (transactional web e-Commerce benchmark)

43 Top Ten TPC-C Results System (Database)tpmCPrice per tpmC IBM p5 595 (IBM DB2 9)4,033, USD IBM eServer p5 595 (IBM DB2 UDB 8.2) 3,210, USD IBM eServer p5 595 (Oracle 10g EE) 1,601, USD Fujitsu PRIMEQUEST 540 (Oracle 10g EE) 1,238, USD HP Integrity Superdome (MS SQL Server 2005 EE SP1) 1,231, USD HP Integrity rx5670 (Oracle 10g EE) 1,184, USD IBM eServer pSeries 690 (IBM DB2 UDB 8.1) 1,025, USD IBM p5 570 (IBM DB2 UDB 8.2) 1,025, USD HP Integrity Superdome (Oracle 10g EE) 1,008, USD IBM eServer p5 570 (IBM DB2 UDB 8.1) 809, USD System (Database)tpmCPrice per tpmC Dell PowerEdge 2900 (MS SQL Server 2005) 65, USD Dell PowerEdge 2800/2.8GHz (MS SQL Server 2005 x64) 38, USD Dell PowerEdge 2800/3.6GHz (MS SQL Server 2005 WE) 28, USD Dell PowerEdge 2800/3.4GHz (MS SQL Server 2000 WE) 28, USD Dell PowerEdge 2850/3.4GHz (MS SQL Server 2000) 26, USD HP ProLiant ML350-T03 (MS SQL Server 2000 SP3) 17, USD HP ProLiant ML350-T03/3.06 (IBM DB2 UDB 8.1 Express) 18, USD HP ProLiant ML350-T03/2.8 (IBM DB2 UDB 8.1 Express) 18, USD HP ProLiant ML370-G4-1M/3.6 (MS SQL Server 2000 EE SP3) 68, USD HP Integrity rx2600 (Oracle 10g)51, USD By Price/Performance: By Performance:

44 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

Presentation of the Results Tables Graphs –Bar graphs (a) –Scatter plots (b) –Line plots (c) –Pie charts (d) –Gantt charts (e) –Kiviat graphs (f) Enhancements –Error bars, boxes or confidence intervals –Broken or offset scales (be careful!) –Multiple curves per graph (but avoid overloading) –Data labels, colors, etc. (a)(b) (c) (d) (e)(f)

Kiviat Graph Example 46 Source:

Mixed Graph Example 47 WRF OOCORE MILC PARATEC HOMME BSSN_PUGH Whisky_Carpet ADCIRC PETSc_FUN3D Computation fraction Communication fraction Floating point operations Load/store operations Other operations Characterization of NSF/CCT parallel applications on POWER5 architecture (using data collected by IPM)

48 Graph Do’s and Don’ts Good graphs: –Require minimum effort from the reader –Maximize information –Maximize information-to-ink ratio –Use commonly accepted practices –Avoid ambiguity Poor graphs: –Have too many alternatives on a single chart –Display too many y-variables on a single chart –Use vague symbols in place of text –Show extraneous information –Select scale ranges improperly –Use line chart instead of a bar graph Reference: Raj Jain, The Art of Computer Systems Performance Analysis, Chapter 10

49 Common Mistakes in Benchmarking Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffering sizes not appropriate Inaccuracies due to sampling ignored Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient performance Using device utilizations for performance comparisons Collecting too much data but doing very little analysis From Chapter 9 of The Art of Computer Systems Performance Analysis by Raj Jain:

50 Misrepresentation of Performance Results on Parallel Computers Quote only 32-bit performance results, not 64-bit results Present performance for an inner kernel, representing it as the performance of the entire application Quietly employ assembly code and other low-level constructs Scale problem size with the number of processors, but omit any mention of this fact Quote performance results projected to the full system Compare your results with scalar, unoptimized code run on another platform When direct run time comparisons are required, compare with an old code on an obsolete system If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar Mutilate the algorithm used in the parallel implementation to match the architecture Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment If all else fails, show pretty pictures and animated videos, and don't talk about performance Reference: David Bailey “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, Aug 1991, pp.54-55,

51 Definitions, properties and applications Early benchmarks Linpack Other parallel benchmarks Organized benchmarking Presentation and interpretation of results Summary

52 Knowledge Factors & Skills Knowledge factors: –benchmarking and metrics –performance factors –Top500 list Skill set: –determine state of system resources and manipulate them –acquire, run and measure benchmark performance –launch user application codes

Material For Test Basic performance metrics (slide 4) Definition of benchmark in own words; purpose of benchmarking; properties of good benchmark (slides 5, 6, 7) Linpack: what it is, what does it measure, concepts and complexities (slides 15, 17, 18) HPL: (slides 21 and 24) Linpack compare and contrast (slide 25) General knowledge about HPCC and NPB suites (slides 31 and 34) Benchmark result interpretation (slides 49, 50) 53