Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.

Slides:



Advertisements
Similar presentations
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Introductions to Parallel Programming Using OpenMP
Parallel computer architecture classification
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
ENIAC, first US Electronic Computer, ENIAC “circuit board”
Types of Parallel Computers
Introductory Courses in High Performance Computing at Illinois David Padua.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
Lecture 2c: Benchmarks. Benchmarking Benchmark is a program that is run on a computer to measure its performance and compare it with other machines Best.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
The Problem With The Linpack Benchmark 1.0 Matrix Generator Jack J. Dongarra and Julien Langou International Journal of High Performance Computing Applications.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Attendance Supercomputer ` Registrations Presentation of the 11th List Hans- Werner Meuer University of Mannheim Supercomputer '98 Conference Mannheim,
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 Jack Dongarra University of Tennesseehttp://
 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
QCD Project Overview Ying Zhang September 26, 2005.
Modifying Floating-Point Precision with Binary Instrumentation Michael Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
SOS71 Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Slide-1 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK)
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
Computing Resources at Vilnius Gediminas Technical University Dalius Mažeika Parallel Computing Laboratory Vilnius Gediminas Technical University
George Tsouloupas University of Cyprus Task 2.3 GridBench ● 1 st Year Targets ● Background ● Prototype ● Problems and Issues ● What's Next.
Itanium 2 Impact Software / Systems MSC.Software Jay Clark Director, Business Development High Performance Computing
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
CS591x -Cluster Computing and Parallel Programming
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Interconnection network network interface and a case study.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
1 High Performance Computing: A Look Behind and Ahead Jack Dongarra Computer Science Department University of Tennessee.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Performance assessment of the 2008 configuration of the JINR CICC cluster A. Ayriyan 1,*, Gh. Adam 1,2, S. Adam 1,2, E. Dushanov 1, V. Korenkov 1, A. Lutsenko.
Seaborg Decommission James M. Craw Computational Systems Group Lead NERSC User Group Meeting September 17, 2007.
Benchmarking, Performance Evaluation, Modeling and Prediction Erich Strohmaier.
BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
High Performance Computing Kyaw Zwa Soe (Director) Ministry of Science & Technology Centre of Advanced Science & Technology.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
Super Computing By RIsaj t r S3 ece, roll 50.
32nd TOP500 List SC08, Austin, TX.
Prof. Thomas Sterling Department of Computer Science
Nicole Ondrus Top 500 Parallel System Presentation
Memory Opportunity in Multicore Era
Advanced Computer Architecture 5MD00 / 5Z033 TOP 500 supercomputers
Presentation transcript:

Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University of Tennessee, Knoxville TN, 37996, Computer Science Technical Report Number CS - 89 – 85, April 8, 2004, url:   FAQ: linpack.htmlhttp:// linpack.html  Courtesy: Jack Dongarra (Top500)   The LINPACK Benchmark: Past, Present, and Future, Jack Dongarra, Piotr Luszczek, and Antoine Petitet  NAS Parallel Benchmarks.

LINPACK (Dongarra: 1979)  Dense system of linear equations  Initially used as a user’s guide for LINPACK package  LINPACK – 1979  N=100 benchmark, N=1000 benchmark, Highly Parallel Computing benchmark

LINPACK benchmark  Implemented on top of BLAS1  2 main operations – DGEFA(Gaussian elimination - O(n 3 )) and DGESL(Ax = b – O(n 2 ))  Major operation (97%) – DAXPY: y = y + α.x  Called n 3 /3 + n 2 times. Hence 2n 3 /3 + 2n 2 flops (approx.)  64-bit floating point arithmetic

LINPACK  N=100, 100x100 system of equations. No change in code. User asked to give a timing routine called SECOND, no compiler optimizations  N=1000, 1000x1000 – user can implement any code, should provide the required accuracy: Towards Peak Performance (TPP). Driver program always uses 2n 3 /3 +2n 2  “Highly Parallel Computing” benchmark – any software, matrix size can be chosen. Used in Top500  Based on 64-bit floating point arithmetic

LINPACK  100x100 – inner loop optimization  1000x1000 – three-loop/whole program optimization  Scalable parallel program – Largest problem that can fit in memory

HPL (High Performance LINPACK)

HPL Algorithm 2-D block-cyclic data distribution Right-looking LU Panel factorization: various options - Crout, left or right-looking recursive variants based on matrix multiply - Number of sub-panels - recursive stopping criteria - pivot search and broadcast by binary-exchange

HPL algorithm  Panel broadcast: -  Update of trailing matrix: - look-ahead pipeline  Validity check - should be O(1)

Top500 (  Top500 – 1993  Twice a year – June and November  Top500 gives Nmax, Rmax, N1/2, Rpeak

TOP500 list – Data shown  ManufacturerManufacturer or vendor  Computer Type indicated by manufacturer or vendor  Installation SiteCustomer  LocationLocation and country  YearYear of installation/last major update  Installation Type Academic, Research, Industry, Vendor, Classified, Government  Installation Areae.g. Research: Energy / Industry: Finance  # ProcessorsNumber of processors  R max Maxmimal LINPACK performance achieved  R peak Theoretical peak performance  N max Problem size for achieving Rmax  N 1/2 Problem size for achieving half of Rmax  N world Position within the TOP500 ranking

24th List: The TOP 5 Ran k Site Country/Year Computer / Processors Manufacturer R max R peak 1IBM/DOE IBM/DOE United States/2004 BlueGene/L beta-System BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) / IBM BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440) NASA/Ames Research Center/NAS NASA/Ames Research Center/NAS United States/2004 Columbia SGI Altix 1.5 GHz, Voltaire Infiniband / SGI SGI Altix 1.5 GHz, Voltaire Infiniband The Earth Simulator Center The Earth Simulator Center Japan/2002 Earth-SimulatorEarth-Simulator / 5120 NEC Barcelona Supercomputer Center Barcelona Supercomputer Center Spain/2004 MareNostrum eServer BladeCenter JS20 (PowerPC GHz), Myrinet / 3564 IBM eServer BladeCenter JS20 (PowerPC GHz), Myrinet Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory United States/2004 Thunder Intel Itanium2 Tiger4 1.4GHz - Quadrics / 4096 California Digital Corporation Intel Itanium2 Tiger4 1.4GHz - Quadrics

24th List: India RankSite Country/Year Computer / Processors Manufacturer R max R peak 267Tech Pacific Exports C Tech Pacific Exports C India/2004 Integrity Superdome, 1.5 GHz, HPlexIntegrity Superdome, 1.5 GHz, HPlex / 288 HP Semiconductor Company (F) Semiconductor Company (F) India/2003 xSeries Cluster Xeon 2.4 GHz - Gig-ExSeries Cluster Xeon 2.4 GHz - Gig-E / 574 IBM Geoscience (C) Geoscience (C) India/2004 xSeries Xeon 3.06 GHz - Gig-ExSeries Xeon 3.06 GHz - Gig-E / 256 IBM Institute of Mathematical Sciences, C.I.T Campus Institute of Mathematical Sciences, C.I.T Campus India/2004 KABRU Pentium Xeon Cluster 2.4 GHz - SCI 3D / 288 IMSc-Netweb-Summation Pentium Xeon Cluster 2.4 GHz - SCI 3D Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM Geoscience (B) Geoscience (B) India/2004 BladeCenter Xeon 3.06 GHz, Gig-EthernetBladeCenter Xeon 3.06 GHz, Gig-Ethernet / 252 IBM

Manufacturer

Architecture

Processor Generation

System Processor Count

NAS Parallel Benchmarks - NPB  Also for evaluation of Supercomputers  A set of 8 programs from CFD  5 kernels, 3 pseudo applications  NPB 1 – Original benchmarks  NPB 2 – NAS’s MPI implementation. NPB 2.4 Class D has more work and more I/O  NPB 3 – based on OpenMP, HPF, Java  GridNPB3 – for computational grids  NPB 3 multi-zone – for hybrid parallelism

NPB 1.0 (March 1994)  Defines class A and class B versions  “Paper and pencil” algorithmic specifications  Generic benchmarks as compared to MPI-based LinPack  General rules for implementations – Fortran90 or C, 64-bit arithmetic etc.  Sample implementations provided

Kernel Benchmarks  EP – embarrassingly parallel  MG – multigrid. Regular communication  CG – conjugate gradient. Irregular long distance communication  FT – a 3-D PDE using FFT. Rigorous test of long distance communication  IS – large integer sort  Detailed rules regarding - brief statement of the problem - algorithm to be practiced - validation of results - where to insert timing calls - method for generating random numbers - submission of results

Pseudo applications / Synthetic CFDs  Benchmark 1 – perform few iterations of the approximate factorization algorithm (SP)  Benchmark 2 - perform few iterations of diagonal form of the approximate factorization algorithm (BT)  Benchmark 3 - perform few iterations of SSOR (LU)

Class A and Class B Sample Code Class A Class B

NPB 2.0 (1995)  MPI and Fortran 77 implementations  2 parallel kernels (MG, FT) and 3 simulated applications (LU, SP, BT)  Class C – bigger size  Benchmark rules – 0%, 5%, >5% change in source code

NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003)  EP and IS added  FT rewritten  NPB 2.4 – class D and rationale for class D sizes  2.4 I/O – a new benchmark problem based on BT (BTIO) to test the output capabilities  A MPI implementation of the same (MPI-IO) – different options using collective buffering or not etc.