Exascale Programming Models in an Era of Big Computation and Big Data

Slides:



Advertisements
Similar presentations
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Computer Architecture & Organization
2nd Workshop on Energy for Sustainable Science at Research Infrastructures Report on parallel session A3 Wayne Salter on behalf of Dr. Mike Ashworth (STFC)
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Computer Architecture and Organization
Please do not distribute
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Computer Architecture Challenges Shriniwas Gadage.
ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.
2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
Computer Architecture and Organization Introduction.
Architecture Examples And Hierarchy Samuel Njoroge.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Chapter 1 Introduction. Architecture & Organization 1 Architecture is those attributes visible to the programmer —Instruction set, number of bits used.
Computer Organization & Assembly Language © by DR. M. Amer.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Lecture on Central Process Unit (CPU)
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Sima Dezső Introduction to multicores October Version 1.0.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Hardware Architecture
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
MOTHER BOARD PARTS BY BOGDAN LANGONE BACK PANEL CONNECTORS AND PORTS Back Panels= The back panel is the portion of the motherboard that allows.
F1-17: Architecture Studies for New-Gen HPC Systems
Multiprocessing and NUMA
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Lynn Choi School of Electrical Engineering
Memory COMPUTER ARCHITECTURE
Hands On SoC FPGA Design
Visit for more Learning Resources
CSNB COMPUTER SYSTEM CHAPTER 1 INTRODUCTION CSNB153 computer system.
Extreme Big Data Examples
Hot Processors Of Today
Assembly Language for Intel-Based Computers, 5th Edition
Architecture & Organization 1
Course Name: Computer Application Topic: Central Processing Unit (CPU)
Memory hierarchy.
CS 286: Memory Paging and Virtual Memory
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Low Latency Analytics HPC Clusters
Linchuan Chen, Xin Huo and Gagan Agrawal
Chapter5.
Architecture & Organization 1
ECEG-3202 Computer Architecture and Organization
Introduction to Computing
Power is Leading Design Constraint
3.1 Introduction to CPU Central processing unit etched on silicon chip called microprocessor Contain tens of millions of tiny transistors Key components:
Characteristics of Reconfigurable Hardware
Course Description: Parallel Computer Architecture
ECEG-3202 Computer Architecture and Organization
Introduction to Heterogeneous Parallel Computing
Chapter 1 Introduction.
Network-on-Chip Programmable Platform in Versal™ ACAP Architecture
What is Computer Architecture?
Modified from notes by Saeid Nooshabadi
Graphics Processing Unit
Computer Architecture
CSE 502: Computer Architecture
Multicore and GPU Programming
William Stallings Computer Organization and Architecture 7th Edition
Presentation transcript:

Exascale Programming Models in an Era of Big Computation and Big Data Barbara Chapman Stony Brook University University of Houston NPC, Xian, October 2016

On-Going Architectural Changes HPC system nodes continue to grow Thermics, power are now key in design decisions Massive increase in intra-node concurrency Trend toward heterogeneity Deeper, more complex memory hierarchies CORAL 40TFlop/s per node Mem density used to grow roughly every 3 years, now by a smaller amount every 4 yeas I/O is flat; off-chip signalling rates rising slowly at best Off-chip BW decacying: actual BW per core dropping dramatically Memory per flop is dropping precipitously Architecture related websites: http://en.wikipedia.org/wiki/TOP500 http://www.extremetech.com/computing/116081-darpa-summons-researchers-to-reinvent-computing

Intel: “Sea of Blocks” Compute Model CE Host Processor: Full x86, TLBs, SSE, . . . sL1 iL1 dL1 Tweaked Decoder AU iL1 sL1 dL1 AU iL1 sL1 dL1 AU iL1 sL1 dL1 AU iL1 sL1 dL1 Special I/O Fabric Async Off. Eng. NLNI Bus Gasket Intra-Accelerator Network AU iL1 sL1 dL1 AU iL1 sL1 dL1 AU iL1 sL1 dL1 AU iL1 sL1 dL1 Bridge sL2 uL2 Standard x86 on-die fabric & memory map MC External DRAM & NVM IPM Bus (c) 2014, Intel

10+ Levels Memory, O(100M) Cores ALU RF L1$ L1S L2$ L2S LL$ LLS IPM DDR NVM Disk Pool O(10) O(100) O(1) O(1,000) Cores per block Blocks w/ shared L2 per die Dies w/ shared LL$/SPAD per socket Boards w/ limited DDR+NVM per Chassis Chassis w/ large DDR+NVM per Exa-machine Machines + Disk arrays Sockets w/ IPM per Board (c) 2014, Intel

Integration of Accelerators: CAPI and APU IBM’s Coherent Accelerator Processor Interface (CAPI) integrates accelerators into system architecture with standardized protocol AMD’s Heterogeneous System Architecture (HSA)-based APU also integrates accelerators Nvidia’s high-speed GPU interconnect Coherence Bus CAPP PSL Power8 CPU GPU Global Memory CPU L2 GPU L2 HW Coherence NVLink X86, ARM64, POWER CPU PCIe

HPC Applications: Requirements Growing complexity in applications Multidisciplinary, increasing amounts of data Very dense connectivity in social networks, etc. How do we minimize communications in apps? Performance Must exploit features of emerging machines at all levels APIs and/or their implementation must facilitate expression of concurrency, help save power, use memory efficiently, exploit heterogeneity, minimize synchronization Performance portability Implies not just that APIs are widely supported But also that same code runs well everywhere Very hard to accomplish Social networks have very dense connectivity Performance less predictable in dynamic execution environment