Topics 8: Advance in Parallel Computer Architectures

Slides:



Advertisements
Similar presentations
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
Erik P. DeBenedictis Sandia National Laboratories February 24, 2005 Sandia Zettaflops Story A Million Petaflops Sandia is a multiprogram laboratory operated.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Lecture 1: Introduction to High Performance Computing.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
1 CHAPTER 2 COMPUTER HARDWARE. 2 The Significance of Hardware  Pace of hardware development is extremely fast. Keeping up requires a basic understanding.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Erik P. DeBenedictis Sandia National Laboratories May 16, 2005 Petaflops, Exaflops, and Zettaflops for Science and Defense Sandia is a multiprogram laboratory.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Winter 2004 Class Representation For Advanced VLSI Course Instructor : Dr S.M.Fakhraie Presented by : Naser Sedaghati Major Reference : Design and Implementation.
ECE 568: Modern Comp. Architectures and Intro to Parallel Processing Fall 2006 Ahmed Louri ECE Department.
Erik P. DeBenedictis, Organizer Sandia National Laboratories Los Alamos Computer Science Institute Symposium 2004 The Path To Extreme Computing Sandia.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Erik P. DeBenedictis Sandia National Laboratories May 5, 2005 Reversible Logic for Supercomputing How to save the Earth with Reversible Computing Sandia.
ECE 569: High-Performance Computing: Architectures, Algorithms and Technologies Spring 2006 Ahmed Louri ECE Department.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
CS203 – Advanced Computer Architecture
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Hardware Architecture
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 6th Edition
Conclusions on CS3014 David Gregg Department of Computer Science
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Processing
CS203 – Advanced Computer Architecture
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Lynn Choi School of Electrical Engineering
Computer Systems are Different!
Super Computing By RIsaj t r S3 ece, roll 50.
Constructing a system with multiple computers or processors
Architecture & Organization 1
Computers © 2005 Prentice-Hall, Inc. Slide 1.
Unit 2 Computer Systems HND in Computing and Systems Development
Parallel Computers Today
Technology and Historical Perspective: A peek of the microprocessor Evolution 11/14/2018 cpeg323\Topic1a.ppt.
Morgan Kaufmann Publishers
BlueGene/L Supercomputer
Architecture & Organization 1
NVIDIA Fermi Architecture
Course Description: Parallel Computer Architecture
Parallel Processing Architectures
CS 258 Parallel Computer Architecture
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
COMS 361 Computer Organization
William Stallings Computer Organization and Architecture 8th Edition
Chapter 4 Multiprocessors
William Stallings Computer Organization and Architecture 8th Edition
Husky Energy Chair in Oil and Gas Research
Facts About High-Performance Computing
Presentation transcript:

Topics 8: Advance in Parallel Computer Architectures 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Reading List Slides: Topic8x 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Why Study Parallel Architecture? Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. Parallelism: Provides alternative to faster clock for performance Applies at all levels of system design Is a fascinating perspective from which to view architecture Is increasingly central in information processing 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Inevitability of Parallel Computing Application demands Technology Trends Architecture Trends Economics 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Application Trends Demand for cycles fuels advances in hardware, and vice-versa Range of performance demands Goal of applications in using parallel machines: Speedup Productivity requirement 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Summary of Application Trends Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads Demand on productivity 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Technology: A Closer Look Basic advance is decreasing feature size ( ) Clock rate improves roughly proportional to improvement in  Number of transistors improves like (or faster) Performance > 100x per decade; clock rate 10x, rest transistor count How to use more transistors? Parallelism in processing Locality in data access Both need resources, so tradeoff Proc $ Interconnect 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Clock Frequency Growth Rate 30% per year 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Transistor Count Growth Rate 1 billion transistors on chip in early 2000’s A.D. Transistor count grows much faster than clock rate - 40% per year, order of magnitude more contribution in 2 decades 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Similar Story for Storage Divergence between memory capacity and speed more pronounced Larger memories are slower Need deeper cache hierarchies Parallelism and locality within memory systems Disks too: Parallel disks plus caching 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Moore’s Law and Headcount Along with the number of transistors, the effort and headcount required to design a microprocessor has grown exponentially 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Architectural Trends Architecture: performance and capability Tradeoff between parallelism and locality Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect Understanding microprocessor architectural trends Four generations of architectural history: tube, transistor, IC, VLSI 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Technology Progress Overview Processor speed improvement: 2x per year (since 85). 100x in last decade. DRAM Memory Capacity: 2x in 2 years (since 96). 64x in last decade. DISK capacity: 2x per year (since 97). 250x in last decade. 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Classes of Parallel Architecture for High Performance Computers (Courtesy of Thomas Sterling) Parallel Vector Processors (PVP) NEC Earth Simulator, SX-6 Cray- 1, 2, XMP, YMP, C90, T90, X1 Fujitsu 5000 series Massively Parallel Processors (MPP) Intel Touchstone Delta & Paragon TMC CM-5 IBM SP-2 & 3, Blue Gene/Light Cray T3D, T3E, Red Storm/Strider Distributed Shared Memory (DSM) SGI Origin HP Superdome Single Instruction stream Single Data stream (SIMD) Goodyear MPP, MasPar 1 & 2, TMC CM-2 Commodity Clusters Beowulf-class PC/Linux clusters Constellations HP Compaq SC, Linux NetworX MCR 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

What we have learned in the last two decade? Building a “good” general-purpose parallel machine is very hard! Proof by contradiction: so many companies went bankrupt in the past decade! 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Babbage Difference Engine A Growth-Factor of a Billion in Performance in a Single Lifetime (Courtesy to Thomas Sterling) 1959 IBM 7094 1976 Cray 1 1991 Intel Delta 1996 T3E 2003 Cray X1 1949 Edsac 1 103 106 109 1012 1015 KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS One OPS 1823 Babbage Difference Engine 2001 Earth Simulator 1951 Univac 1 1964 CDC 6600 1982 Cray XMP 1988 Cray YMP 1997 ASCI Red 1943 Harvard Mark 1 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

[Courtesy of Erik P. DeBenedictis 2004] Applications Demands [Courtesy of Erik P. DeBenedictis 2004] 1 Zettaflops 100 Exaflops 10 Exaflops 1 Exaflops 100 Petaflops 10 Petaflops 1 Petaflops 100 Teraflops System Performance Plasma Fusion Simulation [Jardin 03] 2000 2020 2010 No schedule provided by source Applications Simulation of more complex biomolecular structures [HEC04] Compute as fast as the engineer can think [NASA 99]  100 1000 [SCaLeS 03]  Geodata Earth  Station Range [NASA 02] Full Global Climate [Malone 03] simulation of medium biomolecular structures (us scale) simulation of large biomolecular structures (ms scale) protein folding 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet. [Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report. [NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!” NASA/TM-1999-209715, available on Internet. [NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet. [SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a http://www.pnl.gov/scales/. [DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2004. Presentation at Lawrence Berkeley National Laboratory, also published as Sandia National Laboratories SAND report SAND2004-3333P. Sandia technical reports are available by going to http://www.sandia.gov and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Multi-core Technology Is Becoming Mainstream IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed Unprecedented peak performance Significantly reduces hardware cost with much lower power consumption and heat Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

IBM Power5 Multicore Chip Technology: 130nm lithography, Cu, SOI Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor 24% area growth per core for SMT Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Quad AMD Opteron™ AMD Opteron™ AMD Opteron™ 940 mPGA 940 mPGA 200-333MHz 9 byte Reg. DDR 8-G DRAM AMD Opteron™ 940 mPGA 200-333MHz 9 byte Reg. DDR 8-G DRAM 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management 100 BaseT Management LAN SPI 3.0 interface USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or 802.3 GigE NIC 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

ARM MPCore Architecture 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

ClearSpeed CSX600 250 MHz clock 96 high-performance processing elements 576 Kbytes PE memory 128 Kbytes on-chip scratchpad memory 25,000 MIPS 50 GFLOPS single or double precision 3.2 Gbytes/s external memory bandwidth 96 Gbytes/s internal memory bandwidth 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on http://www.clearspeed.com/ 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

[Courtesy of Erik P. DeBenedictis 2004] Applications Demands [Courtesy of Erik P. DeBenedictis 2004] 1 Zettaflops 100 Exaflops 10 Exaflops 1 Exaflops 100 Petaflops 10 Petaflops 1 Petaflops 100 Teraflops System Performance Plasma Fusion Simulation [Jardin 03] 2000 2020 2010 No schedule provided by source Applications Simulation of more complex biomolecular structures [HEC04] Compute as fast as the engineer can think [NASA 99]  100 1000 [SCaLeS 03]  Geodata Earth  Station Range [NASA 02] Full Global Climate [Malone 03] simulation of medium biomolecular structures (us scale) simulation of large biomolecular structures (ms scale) protein folding 1 PFLOPS 250 TFLOPS 50 TFLOPS [Jardin 03] S.C. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet. [Malone 03] Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report. [NASA 99] R. T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!” NASA/TM-1999-209715, available on Internet. [NASA 02] NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies: NASA’s Contribution to the Operational Agencies,” available on Internet. [SCaLeS 03] Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a http://www.pnl.gov/scales/. [DeBenedictis 04], Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2004. Presentation at Lawrence Berkeley National Laboratory, also published as Sandia National Laboratories SAND report SAND2004-3333P. Sandia technical reports are available by going to http://www.sandia.gov and accessing the technical library. [HEC04] Federal Plan for High-End Computing, May, 2004. 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Multi-core Technology Is Becoming Mainstream IBM: Power, CELL; AMD: Opteron; Intel, RMI, Clearspeed Unprecedented peak performance Significantly reduces hardware cost with much lower power consumption and heat Greatly expands the spectrum of application domains “It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets.” - Intel Fellow, Justin Ratner, IEEE PACT Keynote Speech (Sept 19, 2005) 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

IBM Power5 Multicore Chip Technology: 130nm lithography, Cu, SOI Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor 24% area growth per core for SMT Natural extension to POWER4 design Courtesy of “Simultaneous Multi-threading Implementation in POWER5 --IBM's Next Generation POWER Microprocessor” by Ron Kalla, Balaram Sinharoy, and Joel Tendler of IBM Systems Group 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Quad AMD Opteron™ AMD Opteron™ AMD Opteron™ 940 mPGA 940 mPGA 200-333MHz 9 byte Reg. DDR 8-G DRAM AMD Opteron™ 940 mPGA 200-333MHz 9 byte Reg. DDR 8-G DRAM 200-333MHz 9 byte Reg. DDR AMD Opteron™ 940 mPGA VGA PCI Graphics AMD-8111TM I/O Hub SSL Encryption TCP/IP off load engine Legacy PCI FLASH LPC SIO Management 100 BaseT Management LAN SPI 3.0 interface USB1.0,2.0 AC97 UDMA133 10/100 Ethernet Modular Array ASIC 10/100 Phy GMII to OC-12 or 802.3 GigE NIC 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

ARM MPCore Architecture 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt Courtesy of linuxdevice.com

ClearSpeed CSX600 250 MHz clock 96 high-performance processing elements 576 Kbytes PE memory 128 Kbytes on-chip scratchpad memory 25,000 MIPS 50 GFLOPS single or double precision 3.2 Gbytes/s external memory bandwidth 96 Gbytes/s internal memory bandwidth 2 x 4 Gbytes/s chip-to-chip bandwidth Courtesy of CSX600 Overview on http://www.clearspeed.com/ 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Communication Ports for 3D Mesh Inter-Chip Network A Case Study -- The IBM Cyclops-64 Architecture External Memory Intra-Chip Network Thread Unit FPU SRAM Input Output Processor I-Cache “Processor” 1Gflop/s 64 KB SRAM “Chip” 80Gflop/s 1GB Memory “Board” 320Gflop/s 4GB Memory “Rack” 15.4Tflop/s 192GB Memory “System” 1.1Pflops/, 13.5TB Memory Communication Ports for 3D Mesh Inter-Chip Network Chip Bisection BW: 4TB/s Architect: Monty Denneau 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Data Points of a 1 Petaflop C64 Machine Cyclops Chip: 533 MHz, 5.1 MB SRAM, 1-2GB DRAM Disk space: 300GB/node Total system power: 2 MW (chill-water cooling) Size: 20’ x 48’ Mean time to failure: 2 weeks Cost: 20 million ? 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

A Cyclops-64 Rack 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

C-64 Chip Architecture On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbours = 48GB/sec 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Mrs.Clops 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt

Summary 2018/12/25 \course\cpeg323-05F\Topic-final-323.ppt