Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Computer Abstractions and Technology
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
CS267 - April 26th, 2011 Big Bang, Big Iron High Performance Computing and the Cosmic Microwave Background Julian Borrill Computational Cosmology Center,
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Benchmarks of a Weather Forecasting Research Model Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3,
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Computer Staff Scientist Future Technologies Group Computational Research Division.
1 Recommendations Now that 40 GbE has been adopted as part of the 802.3ba Task Force, there is a need to consider inter-switch links applications at 40.
Pipelining and Parallelism Mark Staveley
Using IOR to Analyze the I/O Performance
Outline Why this subject? What is High Performance Computing?
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
EKT303/4 Superscalar vs Super-pipelined.
Cosmic Microwave Background Data Analysis At NERSC Julian Borrill with Christopher Cantalupo Theodore Kisner.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Tackling I/O Issues 1 David Race 16 March 2010.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Processors
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4 Multiprocessors
Memory System Performance Chapter 3
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter Lawrence Berkeley National Laboratories

Overview  Superscalar cache-based architectures dominate HPC market  Leading architectures are commodity-based SMPs due to generality and perception of cost effectiveness  Growing gap between peak & sustained performance is well known in scientific computing  Modern parallel vectors may bridge gap this for many important applications  In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps  Conducting evaluation study of scientific applications on modern vector systems  09/2003 MOU between ES and NERSC was completed First visit to ES center: Dec 2003, second visit Oct 2004 (no remote access) First international team to conduct performance evaluation study at ES  Examining best mapping between demanding applications and leading HPC systems - one size does not fit all

Vector Paradigm  High memory bandwidth Allows systems to effectively feed ALUs (high byte to flop ratio)  Flexible memory addressing modes Supports fine grained strided and irregular data access  Vector Registers Hide memory latency via deep pipelining of memory load/stores  Vector ISA Single instruction specifies large number of identical operations  Vector architectures allow for: Reduced control complexity Efficiently utilize large number of computational resources Potential for automatic discovery of parallelism However: most effective if sufficient regularity discoverable in program structure Suffers even if small % of code non-vectorizable (Amdahl’s Law)

Architectural Comparison Node TypeWhere CPU/ Node Clock MHz Peak GFlop Mem BW GB/s Peak byte/flop Netwk BW GB/s/P Bisect BW byte/flop MPI Latency usec Network Topology Power3NERSC Fat-tree Power4ORNL Fat-tree AltixORNL Fat-tree ESESC Crossbar X1ORNL D-torus  Custom vector architectures have High memory bandwidth relative to peak Superior interconnect: latency, point to point, and bisection bandwidth Another key balance point is I/O performance: Seaborg I/O: 16 GFPS servers, each w/ 32 GB main memory (for caching & metadata) I/O uses switch fabric, sharing bandwidth with message-passing traffic ES I/O: Each group 16 nodes has a pool of RAID disks attached with fiber channel switch (each node has a separate filesystem)

Previous ES visit Tremendous potential of vector architectures: 4 codes running faster than ever before  Vector systems allows resolution not possible with scalar arch (regardless of # procs)  Opportunity to perform scientific runs at unprecedented scale Evaluation codes contain sufficient regularity in computation for high vector performance However, none of the tested codes contained significant I/O requirements Code (P=64) % peak(P=Max avail) Speedup ES vs. Pwr3Pwr4AltixESX1Pwr3Pwr4AltixX1 LBMHD7%5%11%58%37% CACTUS6%11%7%34%6% GTC9%6%5%20%11% PARATEC57%33%54%58%20% Average

The Cosmic Microwave Background  The CMB is a snapshot of the Universe when it first became neutral 400,000 years after the Big Bang.  After Big Bang the expansion of space cooled Universe sufficiently for charged electrons and neutrons to combine Cosmic - primordial photons filling all of space. Microwave - redshifted by the expansion of the Universe from 3000K to 3K. Background - coming from “behind” all astrophysical sources.

CMB Science The CMB is a unique probe of the very early Universe. Tiny fluctuations in its temperature & polarization encode - the fundamental parameters of cosmology Universe geometry, expansion rate, number of neutrino species, ionization history, dark matter, cosmological constant - ultra-high energy physics beyond the Standard Model

CMB analysis moves from the time domain - observations - O(10 12 ) to the pixel domain - maps - O(10 8 ) to the multipole domain - power spectra - O(10 4 ) calculating the compressed data and their reduced error bars at each step. CMB Data Analysis

MADCAP: Performance  Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that large batches are computed in inner loop  Original ES visit: only partially ported due to code’s requirements of global file system  Could not meet minimum parallelization and vectorization thresholds for ES  All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops  Detailed analysis presented HiPC 2004  Further work performed for MADbench to: reduce I/O, remove system calls, and remove global file system requirements  New results collected from recent ES visit October 2004 P Power 3Power4ESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak %1.529%4.132%2.227% %0.8116%1.923%2.016%

IPM Overview Integrated Performance Monitoring  portable, lightweight, scalable profiling  fast hash method  profiles MPI topology  profiles code regions  open source MPI_Pcontrol(1,”W”); …code… MPI_Pcontrol(-1,”W”); ########################################### # IPMv0.7 :: csnode tasks ES/ESOS # madbench.x (completed) 10/27/04/14:45:56 # # (sec) # # … ############################################### # W # (sec) # # # call [time] %mpi %wall # MPI_Reduce 2.395e # MPI_Recv 9.625e # MPI_Send 2.708e # MPI_Testall 7.310e # MPI_Isend 2.597e ############################################### …

Is a lightweight version of the MADCAP maximum likelihood CMB power spectrum estimation code. Retains the operational complexity & integrated system requirements of the full science code. Has three basic steps - dSdC, invD & W. Out of core calculation: holds approx 3 of the 50 matrices in memory Is used for - computer & file-system procurements. - realistic scientific code benchmarking and optimization. - architectural comparisons. MADbench

This step generates a set of N b dense, symmetric N p xN p signal correlation derivative matrices dSdC b by Lengendre polynomial recursion. Each matrix is block-cyclic distributed over the 2D processor array with blocksize B. As each matrix is calculated, each processor writes its subset of the matrix elements to a unique file. No inter-processor communication is required. Flops: O(N 2 P ) Disk: 8N b N 2 p (primarily writing) dSdC

This step generates the data correlation matrix D and inverts it. The dSdC b matrices are read from disk one at a time and progressively accumulated to build the signal correlation matrix S. A diagonal white noise correlation matrix N is added to S to give the data correlation matrix D, which is inverted using ScaLAPACK to give D -1. Each processor writes its subset of the D -1 matrix elements to a unique file. Flops: O(N 3 P ) Disk: 8N b N 2 p (primarily reading) invD

This step multiplies each dSdC b matrix by D -1 to form W b and derives a Newton-Raphson iterative step from this. Since they are independent, these matrix multiplications can be carried out gang-parallel across N g gangs of processors. Each dSdC b matrix is read in by all processors and then redistributed to the target gang. When all gangs have been given a matrix, they all perform their multiplication simultaneously. Flops: O(N 3 P ) Disk: 8N b N 2 p (primarily reading) W

N p - number of pixels (matrix size). N b - number of bins (matrix count). N g - number of gangs of processors. B - ScaLAPACK blocksize. MOD IO - IO concurrency control (only 1 in MOD IO processors do IO simultaneously). Running on P processors requires: - 3 x 8 x N p 2 bytes of memory per gang - N b x 8 x N p 2 bytes & N b x P inodes of disk - N b a multiple of N g to load-balance the gangs. B & MOD IO are architecture-specific optimizations. Parameters

dSdC performance  ES shows constant I/O performance (independent disks) Significantly fast computation (30X) due to high memory bandwidth Overall only 2.6X faster than Power3 due to I/O overhead  Power3 has faster write I/O until GPFS contention at P=1024

invD performance  I/O remains relatively constant, while MPI overhead and computation grows  Seaborg I/O reads faster than ES  Overall ES only 2.3X faster

W performance  Multi-gang runs significantly reduce MPI overhead (4.8X on ES, 3.3X on Seaborg)  MPI and CALC grow with numbers of processors  I/O trivial part of W calculation  Overall ES is 7X faster

Performance overview  Overall ES 5.6X faster & slightly higher % of peak compared w/ Seaborg for P=1024  For P=256 Seaborg shows higher % peak, due to relative I/O vs. peak flop performance  Although I/O cost remains relatively high, both systems achieve over 50% peak

Overview  New version of Madbench successfully reduced I/O overhead and removed global file system requirements  Allowed ES runs up to 1024 processors, achieving over 50% of peak Compared with only 23% of peak on 64 processors from first visit  Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed  Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (1-2%)  Continue study of complex interplay between architecture, interconnect, and I/O  Currently performing experiments on Columbia and Phoenix  MADbench and IPM being prepared for public distribution  Future CMB analysis will require sparse methods due to size of data sets - potentially at odds with vector architectures