Download presentation
Presentation is loading. Please wait.
Published byWilliam Little Modified over 9 years ago
1
Benchmarks for Parallel Systems Sources/Credits: “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:http://www.netlib.org/benchmark/performance.psurl:http://www.netlib.org/benchmark/performance.ps http://www.top500.org http://www.top500.org FAQ: http://www.netlib.org/utk/people/JackDongarra/faq- linpack.htmlhttp://www.netlib.org/utk/people/JackDongarra/faq- linpack.html Courtesy: Jack Dongarra (Top500) http://www.top500.org http://www.top500.org The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/
2
LINPACK (Dongarra: 1979) Dense system of linear equations Initially used as a user’s guide for LINPACK package LINPACK – 1979 N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark
3
LINPACK benchmark Implemented on top of BLAS1 2 main operations – DGEFA(Gaussian elimination - O(n 3 )) and DGESL(Ax = b – O(n 2 )) Major operation (97%) – DAXPY: y = y + α.x Called n 3 /3 + n 2 times. Hence 2n 3 /3 + 2n 2 flops (approx.) 64-bit floating point arithmetic
4
LINPACK N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n 3 /3 +2n 2 “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500 Based on 64-bit floating point arithmetic
5
LINPACK 100x100 – inner loop optimization 1000x1000 – three-loop/whole program optimization Scalable parallel program – Largest problem that can fit in memory
6
HPL (High Performance LINPACK)
7
HPL Algorithm 2-D block-cyclic data distribution Right-looking LU Panel factorization: various options - Crout, left or right-looking recursive variants based on matrix multiply - Number of sub-panels - recursive stopping criteria - pivot search and broadcast by binary-exchange
8
HPL algorithm Panel broadcast: - Update of trailing matrix: - look-ahead pipeline Validity check - should be O(1)
9
Top500 (www.top500.org) Top500 – 1993 Twice a year – June and November Top500 gives Nmax, Rmax, N1/2, Rpeak
10
TOP500 list – Data shown ManufacturerManufacturer or vendor Computer Type indicated by manufacturer or vendor Installation SiteCustomer LocationLocation and country YearYear of installation/last major update Installation Type Academic, Research, Industry, Vendor, Classified, Government Installation Areae.g. Research: Energy / Industry: Finance # ProcessorsNumber of processors R max Maxmimal LINPACK performance achieved R peak Theoretical peak performance N max Problem size for achieving Rmax N 1/2 Problem size for achieving half of Rmax N world Position within the TOP500 ranking
11
24th List: The TOP 5 Ran k Site Country/Year Computer / Processors Manufacturer R max R peak 1IBM/DOE IBM/DOE United States/2004 BlueGene/L beta-System BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) / 32768 IBM BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) 70720 91750 2NASA/Ames Research Center/NAS NASA/Ames Research Center/NAS United States/2004 Columbia SGI Altix 1.5 GHz, Voltaire Infiniband / 10160 SGI SGI Altix 1.5 GHz, Voltaire Infiniband 51870 60960 3The Earth Simulator Center The Earth Simulator Center Japan/2002 Earth-SimulatorEarth-Simulator / 5120 NEC 35860 40960 4Barcelona Supercomputer Center Barcelona Supercomputer Center Spain/2004 MareNostrum eServer BladeCenter JS20 (PowerPC970 2.2 GHz), Myrinet / 3564 IBM eServer BladeCenter JS20 (PowerPC970 2.2 GHz), Myrinet 20530 31363 5Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory United States/2004 Thunder Intel Itanium2 Tiger4 1.4GHz - Quadrics / 4096 California Digital Corporation Intel Itanium2 Tiger4 1.4GHz - Quadrics 19940 22938
12
24th List: India RankSite Country/Year Computer / Processors Manufacturer R max R peak 267Tech Pacific Exports C Tech Pacific Exports C India/2004 Integrity Superdome, 1.5 GHz, HPlexIntegrity Superdome, 1.5 GHz, HPlex / 288 HP 1210 1728 289Semiconductor Company (F) Semiconductor Company (F) India/2003 xSeries Cluster Xeon 2.4 GHz - Gig-ExSeries Cluster Xeon 2.4 GHz - Gig-E / 574 IBM 1196.41 2755.2 435Geoscience (C) Geoscience (C) India/2004 xSeries Xeon 3.06 GHz - Gig-ExSeries Xeon 3.06 GHz - Gig-E / 256 IBM 961.28 1566.72 438Institute of Mathematical Sciences, C.I.T Campus Institute of Mathematical Sciences, C.I.T Campus India/2004 KABRU Pentium Xeon Cluster 2.4 GHz - SCI 3D / 288 IMSc-Netweb-Summation Pentium Xeon Cluster 2.4 GHz - SCI 3D 959 1382.4 445Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM 946.26 1542.24 446Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM 946.26 1542.24 447Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM 946.26 1542.24 448Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM 946.26 1542.24
15
Manufacturer
16
Architecture
17
Processor Generation
18
System Processor Count
19
NAS Parallel Benchmarks - NPB Also for evaluation of Supercomputers A set of 8 programs from CFD 5 kernels, 3 pseudo applications NPB 1 – Original benchmarks NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O NPB 3 – based on OpenMP, HPF, Java GridNPB3 – for computational grids NPB 3 multi-zone – for hybrid parallelism
20
NPB 1.0 (March 1994) Defines class A and class B versions “Paper and pencil” algorithmic specifications Generic benchmarks as compared to MPI-based LinPack General rules for implementations – Fortran90 or C, 64-bit arithmetic etc. Sample implementations provided
21
Kernel Benchmarks EP – embarrassingly parallel MG – multigrid. Regular communication CG – conjugate gradient. Irregular long distance communication FT – a 3-D PDE using FFT. Rigorous test of long distance communication IS – large integer sort Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results
22
Pseudo applications / Synthetic CFDs Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP) Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT) Benchmark 3 - perform few iterations of SSOR (LU)
23
Class A and Class B Sample Code Class A Class B
24
NPB 2.0 (1995) MPI and Fortran 77 implementations 2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT) Class C – bigger size Benchmark rules – 0%, 5%, >5% change in source code
25
NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003) EP and IS added FT rewritten NPB 2.4 – class D and rationale for class D sizes 2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.