NVIDIA’s Extreme-Scale Computing Project

Slides:

Advertisements

Similar presentations

FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

Advertisements

GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

Exascale Computing: Challenges and Opportunities Ahmed Sameh and Ananth Grama NNSA/PRISM Center, Purdue University.

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!

1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

COMPUTER MEMORY Modern computers use semiconductor memory It is made up of thousands of circuits (paths) for electrical currents on a single silicon chip.

SoC Design Methodology for Exascale Computing

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

1 HPEC 9/22/09 Peter M. Kogge McCourtney Chair in Computer Science & Engineering Univ. of Notre Dame IBM Fellow (retired) May 5, 2009 TeraFlop Embedded.

Exascale Evolution 1 Brad Benton, IBM March 15, 2010.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Future of High Performance Computing Thom Dunning National Center.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Memory  Main memory consists of a number of storage locations, each of which is identified by a unique address  The ability of the CPU to identify each.

Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Vanguard TTI 12/6/11 1. Vanguard TTI 12/6/11 2 Q: What do these have in common? Tianhe 1A ~4PFLOPS peak 2.5PFLOPS sustained ~7,000 NVIDIA GPUs About 5MW.

Introduction to Computer Architecture. What is binary? We use the decimal (base 10) number system Binary is the base 2 number system Ten different numbers.

March 9, 2015 San Jose Compute Engineering Workshop.

Operating Systems Lecture No. 2. Basic Elements  At a top level, a computer consists of a processor, memory and I/ O Components.  These components are.

Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.

Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Chapter 1: Computer Basics Instructor:. Chapter 1: Computer Basics Learning Objectives: Understand the purpose and elements of information systems Recognize.

CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.

Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

A Case for Toggle-Aware Compression for GPU Systems

Enhancements for Voltaire’s InfiniBand simulator

Lynn Choi School of Electrical Engineering

ESE532: System-on-a-Chip Architecture

CS6401- OPERATING SYSTEMS L T P C

Heterogeneous Computation Team HybriLIT

Assistant Professor, EECS Department

Appro Xtreme-X Supercomputers

Introduction to Computer Architecture

OCP: High Performance Computing Project

QuickPath interconnect GB/s GB/s total To I/O

Clusters of Computational Accelerators

CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Parallel Computers Today

Mattan Erez The University of Texas at Austin

שמות מאפיינים ומטרות של זיכרונות ROM - ו RAM

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Mattan Erez The University of Texas at Austin

Introduction to Computing

Computer System Design (Processor Design)

Past Practices of Conventional Core Microarchitecture is Dead

Today’s agenda Hardware architecture and runtime system

SiCortex Update IDC HPC User Forum

Lecture 4 MapReduce Software Frameworks and CUDA GPU Architectures

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

TensorFlow: A System for Large-Scale Machine Learning

Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.

IRAM Vision Microprocessor & DRAM on a single chip:

Ampere for the openEDGE

Presentation transcript:

NVIDIA’s Extreme-Scale Computing Project Echelon NVIDIA’s Extreme-Scale Computing Project

Echelon Team

System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube DRAM Cube NV RAM Self-Aware OS L20 L21023 MC NIC NoC Self-Aware Runtime L0 L0 LC0 LC7 C0 C7 SM0 SM127 Processor Chip (PC) Locality-Aware Compiler & Autotuner Node 0 (N0) 20TF, 1.6TB/s, 256GB N7 Module 0 (M)) 160TF, 12.8TB/s, 2TB M15 Cabinet 0 (C0) 2.6PF, 205TB/s, 32TB CN Echelon System

Power is THE Problem 1 2 3 Data Movement Dominates Power Optimize the Storage Hierarchy 3 Tailor Memory to the Application

The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit DP 20pJ DRAM Rd/Wr 26 pJ 256 pJ 16 nJ 256-bit buses Efficient off-chip link 500 pJ 50 pJ 256-bit access 8 kB SRAM 1 nJ 28nm

Some Applications Have Hierarchical Re-Use

Applications with Hierarchical Reuse Want a Deep Storage Hierarchy

Some Applications Have Plateaus in Their Working Sets

Applications with Plateaus Want a Shallow Storage Hierarchy NoC L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 P P P P P P P P P P P P P P P P

Configurable Memory Can Do Both At the Same Time Flat hierarchy for large working sets Deep hierarchy for reuse “Shared” memory for explicit management Cache memory for unpredictable sharing SRAM SRAM SRAM SRAM NoC P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1 P L1

Configurable Memory Reduces Distance and Energy P L1 SRAM P L1 SRAM P L1 SRAM P L1 SRAM ROUTER ROUTER ROUTER ROUTER

An NVIDIA ExaScale Machine

Lane – 4 DFMAs, 20 GFLOPS

SM – 8 lanes – 160 GFLOPS L1$ Switch P P P P P P P P

Chip – 128 SMs – 20.48 TFLOPS + 8 Latency Processors 1024 SRAM Banks, 256KB each SRAM SRAM SRAM MC MC NI NoC SM SM SM SM SM LP LP 128 SMs 160GF each

Node MCM – 20TF + 256GB GPU Chip 20TF DP 256MB 150GB/s Network BW 1.4TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NV Memory

Cabinet – 128 Nodes – 2.56PF – 38 kW NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

System – to ExaScale and Beyond Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW

CONCLUSION

1 2 3 4 GPU Computing is the Future GPU Computing is #1 Today On Top 500 AND Dominant on Green 500 2 GPU Computing Enables ExaScale At Reasonable Power 3 The GPU is the Computer A general purpose computing engine, not just an accelerator 4 The Real Challenge is Software