edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Parallel computer architecture classification
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
Multiple Processor Systems
MATH 685/ CSI 700/ OR 682 Lecture Notes
Numerical Algorithms Matrix multiplication
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
History of Distributed Systems Joseph Cordina
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Parallel Computing Overview CS 524 – High-Performance Computing.
The Problem With The Linpack Benchmark 1.0 Matrix Generator Jack J. Dongarra and Julien Langou International Journal of High Performance Computing Applications.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
1/26 Design of parallel algorithms Linear equations Jari Porras.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, , Kumar Berman, F., Wolski, R.,
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.
Parallel Processing LAB NO 1.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Parallel Programming in C with MPI and OpenMP
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Multi-Dimensional Arrays
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
By: David McQuilling and Jesus Caban Numerical Linear Algebra.
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
Lecture 8 Matrix Inverse and LU Decomposition
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Computing Resources at Vilnius Gediminas Technical University Dalius Mažeika Parallel Computing Laboratory Vilnius Gediminas Technical University
George Tsouloupas University of Cyprus Task 2.3 GridBench ● 1 st Year Targets ● Background ● Prototype ● Problems and Issues ● What's Next.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
CASPUR Site Report Andrei Maslennikov Lead - Systems Rome, April 2006.
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Outline Why this subject? What is High Performance Computing?
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Performance assessment of the 2008 configuration of the JINR CICC cluster A. Ayriyan 1,*, Gh. Adam 1,2, S. Adam 1,2, E. Dushanov 1, V. Korenkov 1, A. Lutsenko.
Linear Systems Dinesh A.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Ioannis E. Venetis Department of Computer Engineering and Informatics
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Guoliang Chen Parallel Computing Guoliang Chen
Dense Linear Algebra (Data Distributions)
Lecture 8 Matrix Inverse and LU Decomposition
Presentation transcript:

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Tuning LINPACK NxN for HP Platforms Hsin-Ying Lin Piotr Luszczek MLIB team/HEPS/SCL/TCD Hewlett Packard Company HiPer ’ 01 Bremen, Germany October 8, 2001

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 3 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Why tune LINPACK N*N  Customers use TOP500 list as one of the criteria to purchase machines  HP wants to increase the number of computers on the TOP500 list and to help demonstrate HP’s commitment to high performance computing  See

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 4 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM What is LINPACK NxN  LINPACK NxN benchmark Solves system of linear equations by some method Allows the vendors to choose size of problem for benchmark Measures execution time for each size problem  LINPACK NxN report N max – the size of the chosen problem run on a machine R max – the performance in Gflop/s for the chosen size problem run on the machine N 1/2 – the size where half the R max execution rate is achieved R peak – the theoretical peak performance Gflop/s for the machine  LINPACK NxN is used to rank TOP500 fastest computers in the world

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 5 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM TOP500 – Past, Present, and Future  June 2000 – 47 HP systems Cut-off: Gflop/s (Performance of 500 th computer)  November 2000 – 5 HP systems Cut-off: 55.1 GFLOP/s (26% increase from June 2000)  June 2001 – 41 HP systems Cut-off: GFLOP/s (23% increase from November 2000)  November 2001 – ??? HP systems Cut-off: GFLOP/s (23-36% estimated increase from June 2001)

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 6 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM HP list in TOP500 (June 2001)

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 7 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM HP’s TOP500 Status and Goals  About 30 systems missed the entry threshold 55.1 Gflop/s by 1 Gflop/s on Nov. 1, 2000 Goal for Nov. 1, 2001: Ensure all 64 CPU Superdome systems are listed in TOP500  Lack of excellent MPI based Linpack N*N algorithms despite relatively good single node Linpack N*N performance Goal for Nov. 1, 2001: Develop better scalable algorithm for multiple node systems

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 8 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM The Road to Highly Scalable LINPACK NxN Algorithm Studied the public domain software HPL (High Performance LINPACK benchmark): Q: Why HPL? A: Other vendors use HPL for their LINPACK N*N benchmark and show good scalability. See:

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 9 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM HPL(High Performance LINPACK)  MPI implementation of LINPACK NxN benchmark  Algorithm keywords One- and two-dimensional block-cyclic data distribution Right-looking variant of the LU factorization Row partial pivoting Multiple look-ahead depths Recursive panel factorization  Highly tunable (matrix dimension, blocking factor, grid topology, broadcast/factorization algorithms, data alignment)

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 10 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM HPL(High Performance LINAPCK) HPL solves a linear system of order n of the form: A x = b  Compute LU factorization with partial pivoting of n-by-(n+1) matrix: [A,b] = [[L,U],y]  Since the lower triangular factor L is applied to b as factorization progress, the solution x is obtained by solving the upper triangular system: Ux = y

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 11 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Caveat of HPL  The lower triangular matrix L is left un- pivoted and the array of pivots is not returned.  Array b is part of Matrix A.  These imply that HPL is not a general LU factorization software and it cannot be used to solve multiple right hand sides simultaneously.

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 12 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Cyclic 1D division of matrix into 8 panels – with 4 processors P0P0 P3P3 P2P2 P1P1 P0P0 P3P3 P2P2 P1P1 Factor panel 0 Update panel 1-7 using panel 0 Factor panel 1 Update panel 2-7 using panel 1 Factor panel 7... Factor panel 2.

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 13 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Look Ahead Algorithm P0P0 P3P3 P2P2 P1P1 P0P0 P3P3 P2P2 P1P1 Factor panel 0 Update panel 5 using panel 0 Factor panel 1.. Mark panel 1 as factored Update panel 1 using panel 0 Update panel 5 using panel 1

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 14 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Characteristics of HPL  Is most suitable for cluster system, i.e. relatively many low-performance CPUs connected with a relatively low-speed network.  Is not suitable for SMPs as MPI incurs overhead which causes substantial deterioration of performance for a benchmark code.  When look-ahead technique is used with MPI, it requires additional memory to be allocated on each CPU for communication buffer. In an SMP system, such buffer is unnecessary due to the shared memory mechanism.

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 15 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Approach for Tuning LINPACK NxN  Leverage algorithms in HPL Use pthreads instead of MPI for single node Use hybrid of MPI and pthreads for multi-node (Constellation) system; MPI across nodes and pthreads within the node  Leverage HP MLIB’s BLAS routines to improve single CPU performance. See

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 16 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM SD PA8600 vs. other machines Note: Small is better for the number under “Ratio”

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 17 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Constellation PA8600 Performance 1.9x 3.8x 3.9x G: Gigabit Ethernet H: Hyper Fabric

LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 18 Hsin-Ying Lin (T)(972) hiper01.ppt Printed:2/4/2016 6:43:31 AM Summary  We believe that we reached our first goal.  Accomplished our second goal to have better scalable code for HP Constellation system.  4x32 CPUs SD PA8600 could be ranked close to TOP 100, based on TOP500 list of June  1x64 CPUs SD PA8600 could be ranked within TOP 250 based on TOP500 list of June  Performance/CPU of SD PA8600 is about 1.5x, 1.9x, and 2.5x of IBM Power3, SGI O3000, and Sun HPC1000 respectively.