Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Parallel Processing with OpenMP
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Distributed Systems CS
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Computational models of the physical world Cortical bone Trabecular bone.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Today’s topics Single processors and the Memory Hierarchy
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Types of Parallel Computers
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core architectures. Single-core computer Single-core CPU chip.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Multi-Core Architectures
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
Background Computer System Architectures Computer System Software.
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
Vector computers.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS427 Multicore Architecture and Parallel Computing
John Levesque Director Cray Supercomputing Center of Excellence
Programming Models for SimMillennium
Constructing a system with multiple computers or processors
What is Parallel and Distributed computing?
Chapter 4: Threads.
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 4: Threads & Concurrency
Chapter 4 Multiprocessors
Memory System Performance Chapter 3
HPC User Forum: Back-End Compiler Technology Panel
Types of Parallel Computers
6- General Purpose GPU Programming
Presentation transcript:

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence

Outline Building a Petascale Computer Challenges for utilizing a Petascale System –Utilizing the Core –Utilizing the Socket –Scaling to 100,000 cores How one programs for the Petascale System Conclusion

Petascale Computer First we need to define what we mean by a “Petascale computer” –Google already has a Petaflop on their floor Embarrassingly Parallel Application –My Definition Petascale computer is a computer system that delivers a sustained Petaflop to a several “real science” applications

A Petascale Computer Requires: A state-of-the-art Commodity Micro- processor An ultra-fast proprietary Interconnect A sophisticated LWK Operating System to stay out of the way of application scaling Efficient messaging between processors –MPI may not be efficient enough!!

Potential Petascale Computer 32,768 sockets –More dense circuitry results in more processors (cores) on the chip (socket) Each core produces 4 results Each socket contains 4 cores sharing memory –We expect by the end of 2009, micro-processor technology to supply ~ 3 GHZ sockets, each capable of delivering 16 floating point operations per clock cycle *16*3 = 1,572,864 GFLOPS = PFLOPS

Petascale Challenge for Interconnect Connect 32,768 Sockets together with an interconnect that has 2-3 microseconds latency across the entire system Supply a cross-section bandwidth to facilitate ALLTOALL communication across the entire system

Petascale Challenge for Programming Use as 131,072 Uni-processors or 32, way Shared Memory sockets –MPI across all the processors Hard on Socket Memory bandwidth and injection bandwidth into the network –MPI between sockets and OpenMP across socket Hybrid programming is difficult

Petascale Challenge for Software OS must be able to supply required facilities and not be over-loaded with demons that steal cpu cycles and get cores out of sync –The notion of a Light Weight Kernel (LWK) that only has what is needed to run app No keyboard demon, no kernel threads, no sockets, …. Two systems are using this very successfully today, Cray’s XT4 and IBM’s Bluegene

The Programming Challenge We start with 1.5 Petaflops and want to sustain > 1 Petaflop –Must achieve 67% of peak across the entire system Inhibitors –On-socket memory bandwidth –Scaling across 131,072 processors; or, –Utilizing OpenMP on socket, Messaging across system

The Programming Challenge Inhibitors –On-socket memory bandwidth Today we see between 5-80% of sustained performance on the core. This single core sustained performance is the maximum we will achieve. –Scaling across 131,072 processors; or, Today few applications scale as high as 5000 processors –Utilizing OpenMP on socket, Messaging across system OpenMP must be used on a very high percentage of the application; or else, Amdahl’s law applies and peak of Socket may be degraded

Programming for the Core Each core produces 4 floating point results/clock cycle, the memory can only supply 16 bytes/clock cycle –Best case – contiguous on 16 byte boundaries 32 bit arithmetic – 4 words/cycle 64 bit arithmetic – 2 words/cycle –Worse case One word every 2-4 cycles

Consider a Triad Kernel A = B + Scalar * C Need 2 loads and 1 store to produce 1 result How can we produce 4 results each clock cycle, When we need to fetch 16 bytes/clock cycle and store 8 bytes/clock cycle?

Programming for the Core Each core produces 4 floating point results/clock cycle, the memory can only supply 16 bytes/clock cycle –Best case – contiguous on 16 byte boundaries 32 bit arithmetic – 4 words/cycle 64 bit arithmetic – 2 words/cycle –Worse case One word every 2-4 cycles

CACHE to the rescue? To solve the processor/memory mismatch –Caches are introduced to facilitate the re-use of data 2-3 levels of cache L1, L2, L3 –L1 and L2 are dedicated to a core –L3 is typically shared across the cores To improve performance, users must understand how to take advantage of cache –User can improve cache utilization by blocking their algorithms to have a working set that fits in cache –Efficient libraries tend to be cache-friendly ZGEMM achieves 80-90% of peak performance

Programming Challenge Minimize loads/stores and maximize floating point operations –Fortran compilers have been and are extremely good at optimizing Fortran code –C compilers are hindered by use of pointers which confuse the compiler’s data dependency analysis – unless one writes C-tran. –C++ compilers completely give up

Programming Challenge 80% of ORNL major science applications are written in Fortran University students are being taught about new architectures and C, C++ and Java No classes are teaching how to write Fortran and C to take advantage of cache and utilize SSE instructions through the language

Why Fortran? Legacy codes are mostly written in Fortran –Compiler writers tend to develop better Fortran optimizations because of the existing code base 83% of ORNL’s major codes are Fortran Fortran allows the users to relay more information about memory access to the compiler –Compilers can generate better optimized code from Fortran than from C and C++ code is just awful Scientific Programmers tend to use Fortran to get the most out of the system –Even large C++ Frameworks use Fortran computational kernels

What about new Languages? Famous Question –“What languages are going to be used in the year 2000?” Famous Answer –“Don’t know what it will be called; however, it will look a lot like Fortran”

Seriously HPF – High Performance Fortran, was a complete failure. A language was developed that was difficult to compile efficiently. Since use was unsuccessful, programmers quit using the new language before the compiler got better ARPA HPCC – Three new language proposals, will they suffer from the HPF syndrome?

The Hybrid Programming Model OpenMP on the socket –Master/Slave model MPI or CAF or UPC across the system –Single program, Multiple Data (SPMD) Few – Multi-instruction, Multiple Data (MIMD) Co-array Fortran and UPC greatly simplify this into a single programming Model

Shared Memory Programming OpenMP –Directives for Fortran and Pragmas for C Co-Arrays –User specifies a processor: A(I,J)[nproc] = B(I,J)[nproc+1] + C(I,J) If nproc or nproc+1 is on the socket – this is a store into memory, if off processor, it is a remote Memory store. C always comes from memory

How to create a new Language Extend an old one –Co-Array Fortran Extension of Fortran –UPC Extension of C This way the compiler writers only have to address the extension when generating efficient code.

The Programming Challenge Scaling to 131,072 processors –MPI is a more coarse grain messaging, requiring hand- holding between communicating processors User is protected to some degree –Co-Array Fortran and UPC are Fortran and C extensions that facilitate low latency “gets” and “puts” into remote memory. These two languages are commonly known as Global Address Space languages, where the user can address all of the memory of the MPP User must be cognizant of synchronization between processors

Conclusions Scientific Programmers must start learning – how to utilize 100,000s of processors – how to utilize 4-8 cores per socket Fortran is the best language to use for – controlling cache usage – utilizing SSE2 instructions required to obtain >1 result per clock cycle – working with the compiler to get the most out of the core GAS languages such as Co-Arrays and UPC facilitate efficient utilization of 100,000s of processors