Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.

Slides:

Advertisements

Similar presentations

Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.

Advertisements

Memory-centric System Interconnect Design with Hybrid Memory Cubes Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho Ahn,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

55:035 Computer Architecture and Organization Lecture 7 155:035 Computer Architecture and Organization.

The AMD Athlon ™ Processor: Future Directions Fred Weber Vice President, Engineering Computation Products Group.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

1NORDUnet 2003 © 2003, Cisco Systems, Inc. All rights reserved. High-end Routers & Modern Supercomputers Bob Newhall & Dan Lenoski Cisco Systems, Routing.

Chapter 6 Computer Architecture

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

Computer Architecture & Organization

MEMORY ORGANIZATION Memory Hierarchy Main Memory Auxiliary Memory

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group

1 OR Project Group II: Packet Buffer Proposal Da Chuang, Isaac Keslassy, Sundar Iyer, Greg Watson, Nick McKeown, Mark Horowitz

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Lecture 1: Introduction to High Performance Computing.

Chapter 6 Memory and Programmable Logic Devices

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,

Chipset Introduction The chipset is commonly used to refer to a set of specialized chips on a computer's motherboard or.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Comparing High-End Computer Architectures for Business Applications Presentation: 493 Track: HP-UX Dr. Frank Baetke HP.

1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):

Memory  Main memory consists of a number of storage locations, each of which is identified by a unique address  The ability of the CPU to identify each.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.

Main Memory CS448.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Interconnect Technologies and Drivers primary technologies: integration + optics driven primarily by servers/cloud computing thin wires → slow wires; limits.

EMBEDDED SYSTEMS ON PCI. INTRODUCTION EMBEDDED SYSTEMS PERIPHERAL COMPONENT INTERCONNECT The presentation involves the success of the widely adopted PCI.

Computer Organization & Assembly Language © by DR. M. Amer.

2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.

Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman.

1 CS : Technology Trends Ion Stoica and Ali Ghodsi ( August 31, 2015.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 13.

Chapter 1: How are computers organized?. Software, data, & processing ? A computers has no insight or intuition A computers has no insight or intuition.

PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.

Click to edit Master title style Literature Review Interconnection Architectures for Petabye-Scale High-Performance Storage Systems Andy D. Hospodor, Ethan.

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

Tackling I/O Issues 1 David Race 16 March 2010.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

NLR Infrastructure Update

William Stallings Computer Organization and Architecture 6th Edition

Computational Requirements

NVIDIA’s Extreme-Scale Computing Project

Addressing: Router Design

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Introduction to Computing

Characteristics of Reconfigurable Hardware

NLR Infrastructure Update

Rajeev Balasubramonian

Presentation transcript:

Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine

Rev PA1 2 Exascale compute node architecture Node organized around a single computing chip –Interconnecting multiple sockets to inflate node computing power might be counterproductive Too much inefficient data-movement between the sockets –In 2018, transistor scaling will allow many CPUs to be regrouped 100 CPUs (each capable of 100 Gflop/s) on the same chip seems realistic Translate into a 10Tflop/s node (so 100,000 nodes required – similar to Sequoia) Computing chip integrates a massive NoC interconnecting the nodes, the memory interface(s), the network interface(s) (and a few other components) CPUs IOs MEM Other nodes

Rev PA1 3 Compute node "off-chip" architecture Desirable memory bandwidth: 0.5 byte per flop (can live with 0.25 byte per flop as first) –Memory bandwidth: 2.5 TB/s – 5 TB/s (20 Tb/s – 40 Tb/s) Desirable memory capacity: 0.5 bytes/flop (+ 2 – 5 bytes/flop NV) –DRAM memory size: 5 TB  320 HMCs with 16GB (or 40 HMCs with 128GB)* –NV memory size: 20 – 50TB Memory channels likely to show high utilization (~80%) Only several bytes per transaction Computing chip Memory system 20 Tb/s – 40 Tb/s 25 (5 + 20) – 50 (5 + 50) TB HMCs NV * 128 GB HMCs is a possible evolution of the technology $ Cache (size depends on memory parameters)

Rev PA1 4 HMC based memory system HMC forms a network Does the computing chip/ HMC link supports 20 Tb/s? –If yes, the maximal RAM capacity is determined by the maximal depth Currently 8  128 GB (or 1TB with bigger HMCs)  insufficient!  Multiple “access lanes” required Fan-out limited by the pin count (for electrical links) Fan-out also limited by the internal NoC and computing node architecture (too many interfaces consumes and occupy area) Computing chip HMC Maximum depth? HMC Maximum fan-out? Computing chip Heaviest load

Rev PA1 5 Optical links between computing chip and HMCs First segment only First segments (with P2P links) First segments (bus type) All segments Dedicated Hybrid Computing chip HMC Computing chip HMC Computing chip HMC Computing chip Will generally depend on the cost of using multiple interfaces within the computing chip (2-5 is probably okay but is perhaps too much), on the traffic, on the bandwidth. A nice space to explore…