Trends and Challenges in High Performance Computing Hai Xiang Lin Delft Institute of Applied Mathematics Delft University of Technology.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Lecture 6: Multicore Systems

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Lecture 1: Introduction to High Performance Computing.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

1 Programming Multicore Processors Aamir Shafi High Performance Computing Lab

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

QCAdesigner – CUDA HPPS project

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,

Parallel Programming Models

These slides are based on the book:

Conclusions on CS3014 David Gregg Department of Computer Science

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Productive Performance Tools for Heterogeneous Parallel Computing

CS203 – Advanced Computer Architecture

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Tohoku University, Japan

The University of Adelaide, School of Computer Science

Architecture & Organization 1

Architecture & Organization 1

Massive Parallelization of SAT Solvers

Course Description: Parallel Computer Architecture

CSE8380 Parallel and Distributed Processing Presentation

Chapter 1 Introduction.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Hybrid Programming with OpenMP and MPI

Computer Evolution and Performance

The University of Adelaide, School of Computer Science

Chapter 4 Multiprocessors

Chapter 01: Introduction

Vrije Universiteit Amsterdam

Multicore and GPU Programming

The University of Adelaide, School of Computer Science

EE 155 / Comp 122 Parallel Computing

Types of Parallel Computers

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Trends and Challenges in High Performance Computing Hai Xiang Lin Delft Institute of Applied Mathematics Delft University of Technology

What is HPC? Supercomputers are computers which are typically 100 times or more faster than a PC or a workstation. High Performance Computing are typically computer applications running on parallel and distributed supercomputers in order to solve large scale or complex models within an acceptable time. (huge computing speed and huge memory storage)

Computational Science & Engineering Computational Science & Engineering as a third paradigm for scientific research is becoming increasingly more important (traditional paradigms: analytical (theory) and experimental); HPC is the driving force for the rise of the third paradigm (although CSE is not necessary connected to HPC);

Hunger for more computing power Tremendous increase in speed: Clock speed: 106 Parallel processing: 105 Efficient algorithms: 106 Computational scientists and engineers demand for ever more computing power

Outline Architectures Software and Tools Algorithms and Applications

Evolution of supercomputers Vector supercomputers (70’s, 80’s) Very expensive! (Cray 2 (1986): 1 Gflops) Massively Parallel Computers (90’s) Still expensive ( Intel ASCI Red (1997): 1 Tflops) Distributed Cluster Computers (late 90’s - ) Cheap (using off-the-shelf components), and Easy to produce (IBM Roadrunner (2008): 1 Pflops) What’s next? 1 Exaflops in 2019? (cycle of 11 years)

Hardware trends: no more doubling of clock speed every 2~3 years

CMOS Devices hitting a scaling wall Power components: Active power Passive power Gate leakage Sub-threshold leakage (source- drain leakage) Net: Further improvements require structure/material s changes Air Cooling limit Power Density (W/cm2) Chips are reaching physical limits for power density – limiting our ability to continue to increase frequency, the main performance lever of the past Source: P. Hofstee, IBM, Europar 09 keynote

Microprocessor Trends Single Thread performance power limited Multi-core throughput performance extended Hybrid extends performance and efficiency Hybrid Multi-Core Performance Single Thread Power

GPU: e.g., Nvidia G200 (240 cores) and GF-100 (512 cores). Hardware trends: moving towards multi-core and accelerators (e.g., Cell, GPU, …) Multi-Core: e.g., IBM Cell BE: 1 PPE +8 SPE GPU: e.g., Nvidia G200 (240 cores) and GF-100 (512 cores). “supercomputer” affordable for everyone now! E.g., a PC + 1 Tflops GPU. The size is kept increasing – The largest supercomputer will soon have more than 1 million processors/cores (e.g., IBM Sequoia 1.6 Power-processors, 1.6 Pbytes and 20 Pflops, (2012)). Power consumption is becoming an important metric (watts/Gflops) for (HPC) computers.

Geographical distribution of Supercomputing power (Tflops) Figure: John West, InsideHPC

HPC Cluster Directions (according to Peter Hofstee, IBM)

Software and Tools Challenges In the mid 70s to mid 90s, data parallel languages and SIMD execution mode were popular together with the vector computers. Automatic vectorization for array type operations is quite well developed For MPPs and clusters, an efficient automatic parallelizing compiler has not been developed till today Optimizing data distribution, and automatic detection of task-parallelism turns out to be very hard problems to tackle.

Software and Tools Challenges (con.) Current situation: OpenMP works for SMP systems with a small number of processors/cores. For large systems and distributed memory systems, data distibution/communication must be done manually by the programmer, mostly with MPI. Programming GPU type of accelerators using CUDA, OpenCL etc. has sort resemblance of programming vector processors in the old days. Very high performance for certain type of operations, but the programmability and applicability are some what limited.

The programming difficulty is getting severer In contrast to the fast development in hardware, the development of parallel compiler and programming lack behind. Moving towards larger and larger systems enlarges this problem even further Heterogeneity Debugging …

DOE report on ExaScale superomputing [4] “The shift from faster processors to multicore processors is as disruptive as the shift from vector to distributed memory supercomputers 15 years ago. That change required complete restructuring of scientific application codes, which took years of effort. The shift to multicore exascale systems will require applications to exploit million-way parallelism. This ‘scalability challenge’ affects all aspects of the use of HPC. It is critical that work begin today if the software ecosystem is to be ready for the arrival of exascale systems in the coming decade”

The big challenge requires consorted Int’l effort IESP - International Exascale Sofwtare Project [5].

Applications Applications which require Exaflops computing power, examples ([4],[6]): Climate and atmospheric modelling Astrophysics Energy research (e.g., combustion and fusion) Biology (genetics, molecular dynamics, …) … Are their applications which can use 1 million processors? Parallelism is inherent in nature Serialization is a way we deal with complexity Some mathematical and computational models may have to be reconsidered

Algorithms Algorithms with a large degree of parallelism is essential Data locality is important to efficiency Data movement at the cache level(s) Data movement (communication) between processors/nodes

Growing gap betw. processor and memory 11/20/2018 HPC & A

Memory hierarchy: NUMA In order reduce the big delay of directly accessing the main or remote memory, it requires: Optimizing the data movement, maximize reuse of data already in fastest memory; Minimizing data movement between ‘remote’ memories (communication) 11/20/2018 HPC & A

Scale change requires change in algorithms It is well known that sometimes an algorithm with higher degree of parallelism is preferred above an ‘optimal’ algorithm (in the sense of number of operations); In order to reduce the data movement, we need to consider to restructure existing algorithms

An example: Krylov iterative method James Demmel et al, “Avoiding communication in Sparse Matrix Computations”, Proc. IPDPS, April 2008. In an iteration of the Krylov method, e.g. CG or GMRES, typically an SpMV (Sparse Matrix-Vector Multiplication) is calculated: y  y + A x, A is a sparse matrix For ai,j ≠ 0, yi = yi + ai,j * xj SpMV has low computational intensity. ai,j is used only once, no reuse at all.

An example: Krylov iterative method (con.) Consider the operation across a number of iterations, where the “matrix power kernel” [x, Ax, A2x, …, Akx] is computed. Computing all these terms at the same time  minimize the data movement of A (some redundant work) Speedup upto 7x, and 22x across the Grid.

Example: Generating Parallel operation by graph transformations [Lin2001] A Unifying Graph Model for Designing Parallel Algorithms For Tridiagonal Systems, Parallel Computing, Vol. 27, 2001 [Lin2004] Graph Transformation and Designing Parallel Sparse Matrix Algorithms beyond Data Dependence Analysis. Scientific Programming, Vol.12, 2004. May be yet a step too far, first thing should be automatic parallelizing compiler (by detecting parallelism and optimizing data locality).